Over the years I've worked on a few applications that needed to model time intervals that are disjoint. For example, there's some kind of equipment available and time slots are booked out. For a data structure to represent this, you generally need to be able to insert, remove, and perform range queries (i.e., looking for the next availability).
In the past I've represented these with some kind of ordered map or tree. In Rust this might look something like `BTreeMap<u32, 32>` with start being the key and the end being the value. This works really well in practice, but it's not as fast I'd like for some real-time applications for really read-heavy range queries with thousands of intervals.
A lot of the specialized interval libraries I've found don't seem to perform better than a plain ordered map like I mentioned (many use an ordered map internally anyway). Many of these also weren't designed with cache-efficiency or SIMD in mind and spend a lot of time on tree (pointer) traversal.
I spent some time prototyping with some other data structures that might fit (adaptive radix trees to take advantage of the small integer key, dense bitsets for small ranges, applying a spatial index in 1-dimension, fixed grid hierarchies) but haven't been able to find anything much faster than an ordered map.
I was wondering if anyone has come across any interesting data structures that handle this well. Approaches that use a combination of multiple data structures to accelerate reads at the cost of slightly slower inserts/removals could be interesting.
> The slow cases usually tend to be trying to find wider slots and skipping through many smaller slots. There are often a lot of bookings up to the first wide enough slot too though, so it's a little bit of both.
This right here is the first critical issue. It doesn't matter how fast the data structure is if you're then going to do a linear search over the available slots to find the next one wide enough.
A good data structure for doing this efficiently is a range tree (https://en.wikipedia.org/wiki/Range_tree), where in each node you store the maximum width of the available slots covered by the node. That lets you find the first available slot after some time t and wider than some duration w in O(log(n)), where n is the number of available slots. (It might not be obvious if you're not familiar with range trees; I'm happy to provide more details.)
For the implementation, there are a few options:
A. The simplest is to use a complete range tree, where the leaf nodes are all the possible times. You can lazily create nodes to avoid massive memory usage for sparse ranges. The advantage is that you don't need to do any balancing; the disadvantage is that the time complexity is O(long(T)) where T is the total number of possible times; so it's going to be a little slower on very sparse datasets.
B. The O(log(n)) implementation that's being called for is a self-balancing binary tree (e.g. red-black), modified to also store the maximums in each node. Unfortunately most libraries don't give you low-level control over the tree's nodes, so you'd likely need to copy the code and modify it (or implement the self-balancing tree from scratch).
C. If those are still too slow (and you're certain that your implementation is really O(log(n))), you'll need to improve the cache efficiency. That basically comes down to using larger nodes. The obvious approach is to switch to a B-tree; but you could also keep your nodes binary and just change the way they are allocated to emulate the memory layout of a B-tree (this is simpler but slower because it still uses lots of pointers). Another idea is to replace the first few layers of the tree with a hash table (or a simple array if your data is dense enough). Likewise you can replace the leaf nodes with small arrays.