Hacker News new | past | comments | ask | show | jobs | submit login
Demystifying Radix Trees: How Radix trees made blocking IPs 5000 times faster (sqreen.io)
179 points by TaikiSan on Jan 16, 2019 | hide | past | favorite | 73 comments

I gave a talk at GoSF about Radix trees, and how they are used heavily in HashiCorp products (Terraform, Consul, Vault, Nomad, etc). The slides are available here for those interested: https://speakerdeck.com/armon/radix-trees-transactions-and-m...

Radix trees are one of my favorite data structures, and widely under used. For many "dictionary" type lookups, they can be faster and more efficient than hash tables. While hash tables are commonly described as being O(1) for lookup, this ignores the need to first hash the input, which is typically an O(K) operation, where K is the length of the input string. Radix trees do lookups in O(K), without needing to first hash and have much better cache locality. They also preserve ordering allowing you to do ordered scans, get min/max values, scan by shared prefix, and more.

If you combine them with an immutable approach (such as https://github.com/hashicorp/go-immutable-radix), you can support more advanced concurrency, such as lock free reads. This is important for highly concurrent systems to allow scalable reads against shared memory.

I'll plug a shout out to your C https://github.com/armon/libart as well. :-)

Radix trees are extensively used inside Redis. The new Stream data type is completely backed by radix trees, also Redis Cluster tracks keys using a radix tree and so forth. We have an implementation that is not Redis-specific and can be used for many other applications here: https://github.com/antirez/rax.

Many thanks for that and for all your other contributions to the world of open source!

Cool radix trie trick:

Set a fixed-size binary radix trie of 32-bit IP addresses, say 1000 entries. Track the nodes of the trie in a list, LRU order; insert an IP, its node goes to the top of the list.

When you exhaust the available nodes, reclaim from the bottom of the LRU list --- but first, find either a sibling for the node already in the trie, or a parent, or a sibling of what that parent would be, and "merge" the IP address you're losing.

(So in reclaiming, merge with to make, etc).

Over time, "important" /32s --- really, important prefixes, period, not just /32s --- will "defend" their position towards the top of the LRU, while the noise will get aggregated up, into /16s, /15s, /4s, whatever.

What you're doing here is inferring prefix lengths (netmasks), which is kind of magical.

You can do the same thing with memory addresses in a debugger to infer (course-grained, but without much effort) data structures and allocation patterns. There are probably other integers you can do this with that nobody's thought of.

(The data structure is called Aguri).

By reading the OP's comment here, I think it'd be better to update the article with some description on the requirement and why you choose a linear scan initially. Without this understanding, it's really easy to say "why not hash table?".

Not certain but there are two advantages of using a trie in this case, 1) you don’t have to hash the IP address (that takes time too, and is for every packet remember), and 2) you can easily represent whole IP subnets in a trie as one node (though it doesn’t say in the article if they do this, it’s common to block a subnet in a blacklist).

> you don’t have to hash the IP address

That's not necessarily slow. Or at least not as slow as multiple potential cache misses while traversing a tree. Once you know how many addresses you're expecting to have, you can make a hash indexing into a preallocated array which will never get smaller - if it fits into a few cache lines, it's great.

Yeah, though I think the concern is that the article headline acts as if the default would ever have been linear scan. It's like saying "Microwave Ovens: How to make cooking 1,000 times faster" and then comparing to direct solar heating and cooking.

I'm pretty sure it's number 2, as the prefix search requirement is probably the reason why they went with the linear search in the first place, and didn't just use a hash map.

I mean, compared to sequentially comparing thousands of IP addresses (really???), anything would be fast.

People here dismiss the linear search, but for a small number of items (and yes, even thousands can be a small number), I wouldn't be so fast without testing it. Cache locality also has a say here, and depending on the exact circumstances, a linear search might make sense (even more so as a first version).

I think the only time it makes sense to use a linear search is if you think the number of items on your list is low enough that it doesn't even matter what data structure you are using. But I would probably use a set rather than a list anyway if only because that more accurately reflects the nature of the data.

IIRC MySQL just skips using indices and does a linear search when a table size is small enough.

Probably a lot of DB Servers do that - which is why the plan for a query can change as the amount of data per table increases.

I think with AVX2 you could compare 10000 IP addresses in maybe 500 nanoseconds. Assuming they're cached. Uncached maybe 2 microseconds.

It'd still not be the greatest idea... but don't underestimate linear scan on a modern CPU.

> Whenever a threat is detected and blocked, our [software] insert the blocked IP address in a list. This list was then sequentially checked every time a request was evaluated by an agent, thus increasing the latency of our customer’s application. This was a O(n) operation

Right, any database where you just go ALTER TABLE blocklist ADD INDEX ('ip'); is faster than that homebrew O(n) solution. No need for another homebrew solution.

Ended up reading the whole post anyway because now I'm curious if this radix tree might be better than a tree a database would use, but they never compared the performance between any old index and their radix tree.

The CPU time and memory bandwidth needed to search a 5000-entry radix tree does not leave you with a lot of budget to beat it by making any sort of network query to a database, not even one on the same machine. Even in the worst case the radix tree is likely to be done with its lookup before you even get memory allocated to start building the query, you might just barely win if the radix tree has to go all the way down and do all of its memory accesses, but you'll still lose overall. Advantageous early bailouts also mean that the radix tree is often done after just three or four pointer hops, with all of those three or four lookups having a decent chance of being in L1 and almost certain to be in L2, if this is the primary thing the CPU is doing on that machine.

and that is basically what happens in database software when the query executed 2nd and all the other following times. The mildly interesting difference is just radix tree vs. various [usually highly optimized] types of indexes and index/search optimized storage types available in a DB engine.

While correctly implemented embedded radix tree would probably be faster on smallish/medium datasets, the gain may be not worth the costs associated with basically implementing your own DB engine when it comes to datasets which wouldn't in its entirety normally sit in memory all the time and/or when complexity of the queries to run against a dataset makes you to implement your own full blown query language/API and/or when updates of the dataset by downloading/reloading the whole dataset become infeasible and you need to implement your own partial update machinery.

Yes, even if I assume the happy path here, once the prepared query is looked up on the client side, the parameters filled out, the packet constructed to send to the database, the OS kernel switched in to send the packet to the DB process, the DB switched in to select on it, the DB reads the packet, the DB parses the packet to determine what prepared query it's running, the DB finds the correct bytecode (or whatever) to run and passes it to an execution engine to run, yes, at that point the DB will make a highly optimized radix tree or similar lookup itself.

Additional non-"happy path" things the DB may encounter includes using a library or DB that doesn't support prepared queries so you're starting from scratch to construct a query, the DB being across any sort of actual network, and the DB library being thread-safe and having to deal with thread-safe primitives, any one of which is expensive enough to cover a radix tree's entire lookup operation, honestly, because even 50ns is likely to be longer than the entire radix operation.

My point is that the in-process radix tree is done before "the packet is constructed" and long before it can be sent. Depending on the details the radix tree might very well outrace "the prepared query is looked up on the client side" if that involves retrieving it from any sort of structure because you don't just have it in a variable, though under the circumstances assuming that it's just on hand is reasonable.

This is one of those things where you have to be thinking about those "order of magnitude" maps that are often posted about memory access operations, because computers span so many orders of magnitude that our intuition breaks down. Assuming L1 or L2 cache, the radix tree is made of almost entirely of single-digit-order nanosecondd instructions, and not even all that many of them, whereas reaching a database is going to involve microsecond-scale interactions. (To the extent that the radix tree may have to hit real RAM, well, so might we have to hit it for the DB interaction, and the DB is certainly bringing in a lot more code for the code cache, so even then, it's not a win for the DB-based code.)

i'm not arguing about the performance cost of network. Comparing apples to apples would be more about embedded db engines or a db engine on the same machine using a "local mode", not a network mode, driver.

Possibly! We didn't consider databases because this logic is implemented in agents injected in our users' applications. On top of limiting what we had access to, it also put a lot of pressure on us to minimise our CPU & memory overhead.

Because of that, we believed using a database (even in-memory) would have resulted in an unacceptable overhead.

Note that "database" does not necessarily mean "networked database server" to many people.

For example, using the Sqlite library counts as a database for most, and it runs in your process, in your memory. It is essentially a collection of functions for fast data structure manipulation.

I doubt that purpose build data structures optimizing L1/L2 cache, will be slower (or even similar to ) a database running on the same machine. Assuming, of course, that all that's needed can be fit into memory. Databases will do much better when you need to spill to disk.

There is a data structure called Judy arrays, invented by Doug Baskins. Judy array was used at some point in time in Erlang to back up its ets tables.

The paper (2003) comparing various data structures for that task is here

[1] http://erlang.org/workshop/2003/paper/p43-fritchie.pdf


For unsorted tables with populations of 70,000 keys or more, performance improvement by using the judyeh table type instead of set is easily measurable and significant.

This improvement can be seen with keys in sequential order and random order during operations involving table insertion, lookup, and counter updates.

The deletion operation is not as fast as set, but deletion’s extra cost is smaller than the benefit of insertion, lookup, and update operations combined. Furthermore, the additional RAM required by judyeh tables is quite modest.

The judyeh table type requires only about 6% more than set tables, which is smaller than the additional 15% required for the same data in an ordered set table.

The only operation this research examines which is significantly worse for judyeh tables than set tables is table traversal using the ets:next/2 function.


Would be interesting how Judy arrays are compared to radix

Yeah. Even a sqlite library would have been a MASSIVE improvement.

I would atleast use a hashmap rather than a linear scan for lookup. My guess is you would be utmost 10x faster in that case

If the list doesn't change that often, then sort the list and do a binary search for a worst case of O(log n) time.

Binary search is cache and branch-predictor inefficient, although certainly not as bad as linear scan (over large datasets).

If it is only for lookups and is static, why not do a dictionary with perfect hashing? You recalculate the hash each time you need to change the input.

(same as the postgres keyword lookup posted last week to HN using perfect hashing)

And save on some cache misses as well.

Absolutely! This would have been a problem when a IP range was blocked however. You would have to test 32 subnets for any given IPv4 (or 128 for IPv6!). We knew it was not ideal, but it had the least downsides at the time.

This is the question I had the entire time I was reading your post too. Doing set membership testing in constant time is already a solved problem, so why go through the trouble of radix trees when you could use any number of pre-built solutions?

I think it would have been a much more interesting and useful read if you had included the reasons that a radix tree is a better solution for your use case than hashing and how you implement range blocking.

I also wasn't able to understand the advantage of path compression or why being fixed-length is important.

We didn't spend too much time on the specific of our choice, because we chose to make something more widely usable (this is how radix trees work) than the specific product evolution we did. That being said, we probably were a bit light on details.

Going with an hash map was an option, but we feared the ballooning memory footprint when a large number of IPs is blocked, performance outliers (if you're unlucky enough for a lot of blocked IPs/ranges to end up in the same bucket) and we wanted the solution to work as well with IPv6 (where checking each bit takes a bit longer). As for using existing optimal constant time membership tests, we couldn't go with too complex algorithms because we would have to reimplement them and maintain them in each of our agent's language. Radix trees were attractive in this regard because this is a widely available data structure and when it's not, it's fairly easy to implement from scratch.

As for range blocking, it's actually very simple: we block the network IP and tag the node as "range". This way, when in the process of looking up an IP we hit a range node, we know we don't have to look any farther. This has the downside of preventing us from processing differently specific parts of the range but it's not a problem in our architecture.

Path compression lets you save on nodes, which is good for memory and make looking up an IP faster (because you have to go through fewer nodes before hitting a leaf or a divergence). Fixed length makes using bit mask to compare a value to the node simpler and take care of alignment issues, but we could do without it if we had to.

Thanks for the explanations, that makes a lot of sense. Is the memory footprint a concern for individual customers, or is this more about reducing the overall footprint across your combined customers?

I'm not sure what you're referring to. The radix tree is stored within the agent context which is running within our user's application. Because of this architecture, making the radix tree larger make the memory footprint of our user's application larger. Keeping it small was thus advantageous for all our customers, although only one ever had a problem due to them.

I didn't think memory would be a major concern for storing what sounds like less than 100kb of data on a server-side application.

If I understand the post correctly, the advantage for e.g. inserting a subnet is that most of the overall IP address value is shared between each IP in the subnet. In a Radix tree, this shared data will only be represented once in the three (since the entire path from root to leaf makes up the entry value). For a hash based solution, yes you get constant time insertion and constant time access, but you need a unique entry of the entire IP for each individual entry. This increases overall memory requirements.

If query time became a bottleneck, would you consider putting a Bloom Filter in front of the Radix tree? This would have low memory usage and a low false positive rate. Any positive hits would then do an exact check on the radix tree. So, for say 99% of non-blacklisted IPs, you would get constant time blacklist checks.

The problem they were solving had to do with blocking 10,000 IP addresses. This would require maybe 100kb in a hash table, so memory requirements probably aren't the issue.

Wouldn't the benefit of path compression mean reducing the memory footprint of the data structure?

And regarding fixed-length: "This lets us optimize the lookup code." More detail here would be cool but it wasn't really the point of the article.

My guess is that the path compression speeds up lookups by increasing data locality and reducing the number of fetches needed. Memory footprint probably isn't a big concern for storing 10,000 entries.

Path compression just means less pointer dereferences — fewer cache misses and branch mispredictions.

Try adding following optimizations to your binary radix tree code:

1) Path compression: avoid having intermediate empty nodes, when possible. E.g. replace root-left(void)-left(void)-left(void)-left(item) with root-[000]left(item). Because of the node reduction, you'll get better data cache usage.

2) Use a memory pool for the nodes: you'll save because of avoiding malloc() overhead, plus the possibility of using e.g. 32-bit indexes instead of pointers. Like in (1), this would help you reduce the memory usage, being able to put more nodes in the data cache.

3) Nth level lookup table, so you can jump e.g. instead of starting from the root node, you could go directly to the Nth level. For a minor cost in the insertion, for updating the LUT, you could get an important speed-up (2x-3x, more, or less, depending on your data and how you tune the LUT -fixed level LUT, adaptative-level LUT, etc.-).

Radix Trees are used extensively at Akamai. They are very useful in dealing with blocks of network addresses. When I was there, they had 2 libraries which shared the on disk format. One was mutable and had all sorts of operations you could do on them. The other was immutable and optimized for very fast lookups.

Radix trees are great, they are used extensively in GUN and give us incredible storage and routing performance even in decentralized networks: https://github.com/amark/gun/blob/master/lib/radisk.js#L87

Can someone summarize the tradeoffs between radix trees and (various kinds of) B-tree? (I.e., LSM-trees, B-epsilon trees, classic B(+)-trees.)

For something like this application, which seems heavily read-biased, it seems like a classic B-tree would be great. Or even a bloom filter, although the tradeoff there for false positives is maybe not suitable.

Most search trees only need a total ordering on the elements. A radix tree has stricter requirements: the data can be split in parts (e.g. x in x[0]..x[n] and y in y[0]..y[n]) such that each part has a total order and the original order is the prefix order of the parts.

Now you can already save a small constant by putting at each radix level a B-tree again. Then you won't compare x[0] to y[0] again at level (1...n) when you know they are equal (for example a database of all sentences starting with 'The')

True radix trees go further: when every part is finite with a small continuous domain (say 256 elements) you can build a jump table (1 index instead of 8 comparisons and 1 index). Additionally: no balancing needed, less memory needed than a B-tree.

Problems with radix trees: normally you don't store the elements, only rebuild them so no referencing. Extra constraints on your datatype result in higher coupling.

Wish everyone knew this. I still work with vendors who do this sequentially. My company has a consecutive 64 block of IPs and many vendors insist it is too many to whitelist.

Fascinating use of a data structure... but why are they blocking by ip address is the better question.

I am appalled that the understanding of basic complexity theory is missing in a company like this, and that people apparently think this is worth reading.

This SHOULD NOT be news.

Maybe you guys should revisit your school books.

No, this comment is not "too harsh" or elitistic.

No one is forcing you to read it... I'm a self taught developer and I lack a lot of low level data structure/algorithms knowledge. I really enjoyed reading the article myself and learned something from it. Sorry that you didn't enjoy it.

Maybe programmers should educate themselves in the basics before working on products people depend on? Customers PAY that company for their services.

Would you be ok if the people who worked on your house didn't know about construction basics -- and worse, created stuff that's widely known to be ineffective?

So you've just written off internship, apprenticeship, or the possibility of any on-the-job training, because? Everyone must somehow become a master craftsman without any in-market work experience before they start working? That's absurd.

> Customers PAY that company for their services.

If the impact of poor programming is material, customers will send their money elsewhere. Customers evaluate products in a marketplace; unless the sale explicitly mentioned the developers' qualifications, it's not like they are defrauding customers.

No, he doesn't. Any of those alternatives you mention (apprenticeship, etc.) require someone senior to you overseeing your tasks/progress. That someone can be a colleague, boss, CS teacher, etc. It's up to the entity overseeing your development (be it a school, or a company) to provide such expertise.

Once again, it's OK that you (person, programmer) don't know that. But as a company, there should be a strong guarantee that you (company) have the knowledge to build a good product. It's fine for a company to have some people who don't know this stuff, but not to not have anyone capable (and in a position that allows them to) notice these kind of mistakes.

The parent comment is talking about individuals, not companies:

> Maybe programmers should educate themselves in the basics before working on products people depend on?

That remark clearly puts the onus on individual workers and there is no qualifying statement that anyone who doesn't know this particular thing should be allowed to work, under supervision.

>That remark clearly puts the onus on individual workers and there is no qualifying statement that anyone who doesn't know this particular thing should be allowed to work, under supervision.

Ever heard of the principle of charity? Programmers should educate themselves, and that includes senior supervision. And they should make sure they used what's best for the job when the ship while still learning -- and if they're unsure, they could check with someone...

The sarcastic question about my reading being uncharitable is amusingly disconnected from your original, extremely uncharitable comment about the OP.

>So you've just written off internship, apprenticeship, or the possibility of any on-the-job training, because?

Nope, didn't wrote off any of those. In fact I encouraged them, saying programmers should educate themselves (and all of those are ways to do that).

What I did write off was letting selling ineffective staff by interns, apprentices, on-the-job trainees ship to customers...

Do customers pay for a quality product or to have whatever done by cheap or unpaid internets shipped to them?

I'm also appalled by why they even need to recode a radix tree themselves to filter out a DDOS, considering the many options available without reimplementing the wheel.

any serious hoster filter out ddos with a router. routers are basically build to test IP against subnets, so they can basically run each IP address against a million of hardware subnet testers all running in parallel and decide what to do depending on the result.

And if you don't have a router, any OS worth its salt already have a radix tree implementation for its routing table. The only thing you need to turn a Linux routing table into a ddos filter is to enable reverse path filtering and then add blackhole routes with iproute2.

The first part of your comment might be true, but regarding the newsworthiness of the post: there is always someone reading this from today's lucky 10000 [0]. It's generally great if complexity theory gets more coverage, here and on company's blog posts. Even if you know this inside out, your colleague might just hear it for the first time, or they might just want to read it up again because they didn't pay attention in class. Which is good for you too. Also, tomorrow you might be one of those lucky 10k.

[0] https://xkcd.com/1053/

I think it's great to put out an article like this, for the reader. But it's terribly embarrassing for the company.

Well, this is one reason why some companies insist on grilling candidates on "academic stuff".

Guess what, sometimes it's needed. Even knowing that a given data structure exists is sometimes enough.

I don't know why you're being downvoted, but I agree with your assessment. Your tone is perhaps a bit harsh, but this is precisely the reason large companies grill candidates on basic CS knowledge because like it or not it forms a fundamental base for you to build stuff on.

I'm more disappointed than appalled.

It's 5000 times faster than an sequential list lookup.

It is a good discussion of radix trees, though.

This indeed should not be news. It's OK for you as a developer (even a professional one) to not know about this. You can learn it from your colleagues, debugging poor performance situations like this, etc.

But it's not OK for a company to offer a service built in such a naive way (naive from a comp-sci point of view). To me this kind of mistakes are a symptom of having a poor CTO, poor hiring/dev practices, etc.

get over yourself bub

My education is equivalent to sophomore in CS with a couple years professional experience tacked on.

I'm pretty disappointed in your explanation of a radix tree, since I'd only had a very cursory exposure to one before. I don't think it's a good or even decent explanation. The point you try to make about how the tree only stores the end node is pretty incomprehensible when you are coming from a single digit BST to this. You should have sorted something that at least made sense to sort with a radix tree in your example.

Please do not engage in IP blocking, this is one of the most evil things that you can do on a site, along with displaying a white page when the user has JS disabled and using Google reCAPTCHA.

That's easy to say in general.

However, in the case of mitigation of a Denial of Service attack, using IP blocking is an effective temporary measure.

From my experience in dealing with HTTP DDoS-style attacks there's usually a pretty consistent block of IP addresses where the attack originates from, i.e. an AWS or Hetzner IP range. When the choices are to block IPs temporarily or have the entire service go down due to resource exhaustion, the choice is pretty clear cut.

We'd implement the block to keep our service up then report attacker IPs to AWS or whoever maintained the range in question. As the attack waned, we'd re-enable the range.

Of course, IP blocks usually did not work as well with udp-based reflection style DDoS mitigation, but we'd still see certain IP ranges responsible for significant portion of the attack.

We ran our own hardware though, so resources like CPU and inbound bandwidth weren't elastic enough to outpace an escalating DDoS past 120-140 Gbps. We also eventually moved to adding DDoS mitigation through third parties like CloudFlare.

I assume the IPs are only blocked after a known attack and not indefinitely, not based on blocklists from third parties or indefinitely (think spamhaus, which is rather beneficial to a few big players). Of course banning users without JS and using Google's CAPTCHAs is a big pain in the ass, but if someone requests my /phpmyadmin/setup.php out of the blue (never saw the IP before, not a known person who is just curious about my site), I think it's fair game to block the IP. I don't because I'd much rather know what other attacks they are using, but I do not think it's unethical to do so, given of course temporarily and locality.

As long as the blocks are temporary, and of course there is a good reason for the block, I actually think it makes a lot of sense.

I agree with you, sadly it seems that most that do ban IPs tend to do it permanently and without a good reason (I do not think that banning all the IPs of all dedicated servers and tor nodes in the world has a good reason)

if you keep banning IPs then you eventually get to a point where its more efficient, and cost effective to just yank the cable and, ignore the internet.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact