Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What are some cool but obscure data structures you know about?
2051 points by Uptrenda 17 days ago | hide | past | favorite | 753 comments
I'm very interested in what types of interesting data structures are out there HN. Totally your preference.

I'll start: bloom filters. Lets you test if a value is definitely NOT in a list of pre-stored values (or POSSIBLY in a list - with adjustable probability that influences storage of the values.)

Good use-case: routing. Say you have a list of 1 million IPs that are black listed. A trivial algorithm would be to compare every element of the set with a given IP. The time complexity grows with the number of elements. Not so with a bloom filter! A bloom filter is one of the few data structures whose time complexity does not grow with the number of elements due to the 'keys' not needing to be stored ('search' and 'insert' is based on the number of hash functions.)

Bonus section: Golomb Coded Sets are similar to bloom filters but the storage space is much smaller. Worse performance though.

Not a very deep CS-y one, but still one of my favourite data structures: Promise Maps.

It only works in languages where promises/futures/tasks are a first-class citizen. Eg JavaScript.

When caching the result of an expensive computation or a network call, don't actually cache the result, but cache the promise that awaits the result. Ie don't make a

    Map<Key, Result>
but a

    Map<Key, Promise<Result>>
This way, if a new, uncached key gets requested twice in rapid succession, ie faster than the computation takes, you avoid computing/fetching the same value twice. This trick works because:

- promises hold on to their result value indefinitely (until they're GC'ed)

- you can await (or .then()) an existing promise as many times as you want

- awaiting an already-resolved promise is a very low-overhead operation.

In other words, the promise acts as a mutex around the computation, and the resulting code is understandable even by people unfamiliar with mutexes, locks and so on.

FYI, if you plan to do this in ASP netcore, combine with AsyncLazy for the most optimal results https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/b...

Neat page from David Fowler. I didn't know about that one. Thank you.

To further explain why you want to use an AsyncLazy (or Lazy for synchronous programming) when working with ConcurrentDictionary I find this article to be helpful: https://andrewlock.net/making-getoradd-on-concurrentdictiona...

Be careful with this data structure. If the language allows async exceptions or you have a big where the promise won’t become deferred, there are a lot of edge cases. Examples of edge cases:

- if the promise never becomes determined (eg bug, async exception) your app will wait forever

- if the promise has high tail latency things can be bad

- if the language eagerly binds on determined promises (ie it doesn’t schedule your .then function) you can get weird semantic bugs.

- changing the keys of the table incrementally instead of just using for memoisation can be really messy

I have a couple pieces of code where we had to add rate limiting because it was just too easy to make too many async calls all at once, and things only got 'worse' as I fixed performance issues in the traversal code that were creating an ersatz backpressure situation.

Philosophically, the main problem is that promise caches are a primary enabler of the whole cache anti-pattern situation. People make the mistake of thinking that 'dynamic programming' means memoization and 'memoization' means caching. "Everything you said is wrong." Trying to explain this to people has been a recurring challenge, because it's such a common, shallow but strongly-held misunderstanding.

Youtube recently suggested this video to me, which does a pretty good job of explaining beginning and intermediate DP:


What I love most about this video is that it introduces memoization about 10 minutes in, but by 20% of the way through it has already abandoned it for something better: tabulation. Tabulation is an interative traversal of the problem that gathers the important data not only in one place but within a single stack frame. It telegraphs a information architecture, while recursive calls represent emergent behavior. The reliance on cache to function not only obscures the information architecture, it does so by introducing global shared state. Global shared state and emergent behavior are both poison to the long-term health of a project, and here we have a single data structure that represents both in spades.

We are supposed to be fighting accidental coupling, and especially global variables, not engaging in word games to hide what we are doing.

So I think my answer to OP's question is 'tabulation', which is part data structure and part algorithm.

I wrote some Python code recently that uses a similar data structure (Futures instead of Promises, without knowing necessarily about the data structure's use in JavaScript). It wasn't really for caching.

I modeled mine on the Django ASGI reference library's server implementation, which uses the data structure for maintaining references to stateful event consumers. Exception handling is done with a pre-scheduled long-running coroutine that looks at the map.

I'm curious about your second point -- why exactly do things get bad with high tail latency? Is it only a weakness of the data structure when used for caching? I'm having trouble picturing that.

Suppose the call to evaluate has a p95 of 5 seconds (this is very large of course). If your first call to compute a value hits it, all the subsequent requests for that cell are blocked for 5s. If you didn’t try to cache, only one might block for 5s and the rest could go through fast. On the other hand if you do 20 requests then about one of them will get the p95 latency.

Best practice in my experience is to use a timeout on all async operations to handle edge cases 1 & 2 above. The third case isn't possible with JavaScript AFAIK.

- if the language eagerly binds on determined promises (ie it doesn’t schedule your .then function) you can get weird semantic bugs.

What would be an example of this? If the promise has been determined, why not just immediately run the .then function?

The reason not to do it is to limit the non-determinism of the order of execution.

In Javascript

functions are always called in the relative order f1, f2, f3 and never f1, f3 , f2 even if p had already resolved.

I know this as memoization with lazy evaluation, nothing new, but at the same time very useful, and I would argue, it is very CS.

not really quite lazy evaluation, thought, at least in Javascript. Promises begin execution as soon as they are created and there is no way to delay that.

Promises are an implementation of lazy evaluation for Javascript. This is exactly lazy evaluation.

By the way, lazy evaluation is the one that offers no guarantees about when your code will be executed. If you can delay the execution, it's not lazy.

Wait. I thought lazy evaluation is defined as evaluation at the time a value is needed. After that I think it will then be in Weak Head Normal Form, which can be thought of as "evaluated at its top level"... but I'm a bit rusty. Basically, an expression gets used somewhere (e.g. pattern matched, in the ADT sense). If it's in WHNF, cool, it's evaluated already (subexpressions within may not yet, but that's their problem--that's exactly what WHNF says). If not, we evaluate it down to WHNF.

Wikipedia states it as: "When using delayed evaluation, an expression is not evaluated as soon as it gets bound to a variable, but when the evaluator is forced to produce the expression's value."

So if evaluation begins effectively at the time the promise is issued, not at the time the promise is awaited, then I would not call that lazy evaluation, and what hn_throwaway_99 said sounds correct to me.

Is my rusty old PL understanding wrong?

> Wait. I thought lazy evaluation is defined as evaluation at the time a value is needed.


You've got it right, GP comment has it wrong.

Lazy does not simply mean "evaluated later". From what I understand, in JS, if you call an async function without `await`, it essentially gets added to a queue of functions to execute. Once the calling function stack completes, the async function will be executed.

A queue of functions to execute does not constitute "lazy evaluation" on its own.

Definition on Wikipedia says otherwise:

“In programming language theory, lazy evaluation, or call-by-need,[1] is an evaluation strategy which delays the evaluation of an expression until its value is needed (non-strict evaluation) and which also avoids repeated evaluations (sharing).”

Haskell would be an apt example here.

Promises in javascript are absolutely not considered 'an implementation of lazy evaluation.' Promises are invoked eagerly in javascript.

I've always called it request piggy backing (atop memoization). Our db abstraction at my last gig absolutely required it for scaling purposes. Great tool for the tool belt.

Can someone explain to me why the second example is better.

To me it seems to be the same thing. Replace result with int and I literally do not see a problem with the first one.

Also why is a mutex or lock needed for Result in javascript? As far as I know... In a single threaded application, mutexes and locks are only needed for memory operations on two or more results.

With a single value, say an int, within javascript you are completely safe in terms of concurrent access. Only 2 more more variables can cause a race condition. Or am I wrong here?


Thanks for the replies. I see now. The mutex concept is not around memory access or memory safety. It's solely around the computation itself. When the Promise is "pending". It has nothing to do with safety. The parent did mention this, but I completely glossed over it.

You can put the promise into the cache immediately but you can only put the result from the promise into the cache once the promise resolves. So if an identical request comes in a second time before the promise has been resolved, then if you are caching the promise you have a cache hit but if you are caching the result then you have a cache miss and you end up doing the work twice.

I am still not understanding the purpose of this as I believe it is grounded on the wrong assumption.

Pretty much every single asynchronous operation other than some `Promise.resolve(foo)` where foo is a static value can fail. Reading from the file system, calling an api, connecting to some database, etc.

If the original promise fails you're gonna return a cached failure.

Mind you, I'm not stating this might be completely useless, but at the end of the day you will be forced to add code logic to check all of those asynchronous computation results which will eventually outweight the cost of only saving the resolved data.

> If the original promise fails you're gonna return a cached failure.

This is usually a good thing, and even in cases where it isn't, it's often a worthwhile trade-off.

In the common case a failure will either be persistent, or - if load/traffic related - will benefit from a reduction in requests (waiting a while to try again). In both of these cases, where your first key request fails, you want the immediately-following cases to fail fast: "caching" the promise caches the failure for the duration of one request (presumably the cache is emptied once the promise resolves, allowing subsequent key accesses to retry).

The less common case where the above isn't true is where you have very unstable key access (frequent one-off failures). In those cases you might want a cache miss on the second key request, but successful key retrieval usually isn't as critical in such systems which makes the trade off very worthwhile.

> If the original promise fails you're gonna return a cached failure.

In the scenario where that's an issue, you would need to add (extremely trivial 3-5 lines) logic to handle retrying a cached failure. The underlying data structure would continue to be a promise map.

> If the original promise fails you're gonna return a cached failure.

In many cases, another near-in-time request would also fail, so returning a cached failure rather than failing separately is probably a good idea (if you need retry logic, you do it within the promise, and you still need only a single instance.)

(If you are in a system with async and parallel computation both available, you can also use this for expensive to compute pure functions of the key.)

You do one IO operation instead of two.

It's unlikely that always doing two is going to be more successful than just trying once and dealing with failure

It's a tiny optimization.

When the VERY FIRST ASYNC operation is inflight the cache is immediately loaded with a Promise, which blocks all other calls while the FIRST async operation is in flight. This is only relevant to the very FIRST call. That's it. After that the promise is pointless.

As for the Promise failure you can just think of that as equivalent of the value not existing in the cache. The logic should interpret the promise this way.

It's not always a tiny optimization. If you have an expensive query or operation, this prevents potentially many duplicative calls.

A practical example of this was an analytics dashboard I was working on years ago -- the UI would trigger a few waves of requests as parts of it loaded (batching was used, but would not be used across the entire page). It was likely that for a given load, there would be four or more of these requests in-flight at once. Each request needed the result of an expensive (~3s+) query and computation. Promise caching allowed operations past the first to trivially reuse the cached promise.

There are certainly other approaches that can be taken, but this works very well as a mostly-drop-in replacement for a traditional cache that shouldn't cause much of a ripple to the codebase.

Yep, I've been bitten by exactly this failure mode.

You have to invalidate/bust the cache when a failure is returned (which is racy, but since cache busting is on the sad path it's a fine place to protect with a plain ol' mutex, assuming you even have real parallelism/preemption in the mix).

Alternatively you can cache promises to functions that will never return failure and will instead internally retry until they succeed. This approach generalizes less well to arbitrary promises, but is more friendly to implementing the custom back off/retry scheme of your choice.

It's not the cost of saving the resolved data.

If I understand the pattern correctly, it is to avoid multiple asynchronous requests to a resource that has yet to be cached.

Yeah, that's my understanding too.

It seems like an optimization to prevent lots of downstream requests that occur in rapid succession before the first request would have finished. I'd also suspect that the entry would be removed from the map on failure.

I believe the idea is that if you stored the result instead, you would have to wait for the promise to resolve the first time. If you made two requests in quick succession, the second request wouldn't see anything in the result map for the first one (as it hasn't resolved yet) and would then need to make its own new request.

If you store the promise, then not only does the second attempt know that it doesn't need to make a new request, but it can also await on the promise it finds in the map directly.

Wait a minute.

Under Javascript, The behavior induced by this pattern is EXACTLY equivalent to that of CACHING the result directly and a BLOCKING call. The pattern is completely pointless if you view it from that angle. Might as well use blocking calls and a regular cache rather then async calls and cached promises.

So then the benefit of this pattern is essentially it's a way for you to turn your async functions into cached non-async functions.

Is this general intuition correct?

--edit: It's not correct. Different async functions will still be in parallel. It's only all blocking and serial under the multiple calls to the SAME function.

Say you have three elements in your UI that need to fetch some user info. You have two major options:

1/ Have a higher level source of truth that will fetch it once from your repository (how does it know it needs to fetch? How is cache handled ?) and distribute it to those three elements. Complex, makes components more independent but also more dumb. It's fine to have pure elements, but sometimes you just want to write <UserInfoHeader/> and let it handle its stuff.

2/ Your repository keeps this Map<Key, Promise<T>>, and every time you call getUserInfo(), it checks the map at key "userinfo" and either return the promise (which might be ongoing, or already resolved) or see that it's not there and do the call, writing the promise back into the map. This way, your three components can just call getUserInfo() without giving a damn about any other ones. The first one that calls it pre-resolves it for others.

As to why a promise instead of just the raw result: one can potentially return null (and you need to call again later to refresh, or straight up blocks during the entire call), the other one just gives you back promises and you can just listen to them and update your UI whenever it's ready (which might be in 5 seconds, or right now because the promise has already been resolved)

It's a bad implementation of a cached repository (because it ignores TTL and invalidation as well as many problems that need to be handled) that any junior developer could figure out (so it's everything but obscure), but sometimes, eh, you don't need much more.

Because the computation of Result is _expensive_ and _async_.

So if you get two requests for the computation that computes Result in close succession the Map doesn't have the result in place for the second call and as a result you get no caching effect and make 2 expensive requests.

By caching the Promise instead the second request is caught by the cache and simply awaits the Promise resolving so now you only get 1 expensive request.

I didn't know this was a formal design pattern; I do recall having implemented this myself once, as a means to avoid double requests from different components.

Later on I switched to using react-query which has something like this built-in, it's been really good.

This is only tangentially related to:

> you avoid computing/fetching the same value twice

But this problem comes up in reverse proxies too. In Nginx, you can set the `proxy_cache_lock`[0] directive to achieve the same effect (avoiding double requests).

[0]: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#pr...

That right there is Cache Stampede[0] prevention.

[0]: https://en.wikipedia.org/wiki/Cache_stampede

Interesting! This is an issue I had with an external API which I intended to cache on my serverless workers infra.

I hit the API's rate limiter when the workers invalidated their cache, even though I staggered the lifetime of the cache keys for each replicated instance. This is how I found out how many Cloudflare workers run on a single edge instance. Hint: It's many.

Didn't know it had a name. I'm delighted, thanks!

You’re welcome, happy to help. If you are in the .NET space I suggest you to take a look at FusionCache. It has cache stampede protection built in, plus some other nice features like a fail-safe mechanism and soft/hard timeouts https://github.com/jodydonetti/ZiggyCreatures.FusionCache

I love this approach and have used it many times in JavaScript. I often end up adding an additional map in front with the resolved values to check first, because awaiting or then'ing a Promise always means you will wait until the next microtask for the value, instead of getting it immediately. With a framework like React, this means you'll have a flash of missing content even when it is already cached.

Replacing the promise with the result in the map (using a sum type as the map value type) might be more efficient than maintaining two maps?

This surprises me. I had expected the microtask to be executed right after the current one, ie before any layouting etc - isnt that the whole point the "micro" aspect?

I was able to reproduce this in a simple example [1]. If you refresh it a few times you will be able to see it flash (at least I did in Chrome on Mac). It is probably more noticeable if you set it up as an SPA with a page transition, but I wanted to keep the example simple.

[1] https://codesandbox.io/s/recursing-glitter-h4c83u?file=/src/...

I am deeply disturbed and confused. Do you understand why this happens?

The event loop shouldn't continue the circle until the microtask queue is empty, from what I remember.

Browsers probably used to squeeze layouting and such things in when the event loop completed an iteration or became idle. However nowadays it's entirely possible that have, or will have, browsers that do even work in parallel while the javascript event loop is still running, possibly even beginning to paint while javascript is still running synchronously (haven't seen that yet).

I've seen instances where browsers seem to be like "nah, I'll render this part later" and only would render some parts of the page, but certain layers (scrolling containers) would not be filled on the first paint - despite everything being changed in one synchronous operation!* And I've seen them do that at least 5 years ago.

It seems to me the situation is complicated.

* There was a loading overlay and a list being filled with data (possibly thousands of elements). The loading overlay would be removed after the list was filled, revealing it. What often happened was that the overlay would disappear, but the list still looking empty for a frame or two.

Yeah, saves you from thundering herd problems

https://github.com/ben-manes/caffeine cache does something similar by caching the future that gets returned on look-ups if the value is still being computed

I have a similar pattern when using an rxjs pipeline to do async tasks in response to changes in other values. The typical pattern would be

    readonly foo = bar.pipe(
        switchMap(v => doTask(v)),
Where doTask returns a promise or an AsyncSubject. The shareRelpay allows foo to cache the last value so any late subscriber will get the most recent value. This has a problem of returning cached values while new work is happening (eg while a network request is in progress) so instead I like to write

    readonly foo = bar.pipe(
        map(v => doTask(v),
        switchMap(v => v),
This way the shareReplay is caching the promise instead of the result. With this pattern I can interact with this pipeline from code like so

    async someEventHandler(userInput) {
        const result = await firstValueFrom(this.foo);
Without the pattern, this code would get the old result if there was ever any cached value

It’s also the clearest and least buggy way to iterate over the results. Map over Await Promise.all(map).

Partly off-topic, but in JS I tend to avoid iterating over promises with maps because it's always confusing what will/won't work, so I use a basic FOR loop because async/await works in there.

How bad is this? Should I switch to Promise.all instead?

Iterating with async/await means you wait for every result before making another call. You basically execute the async calls one by one.

Promise.all runs them all simultaneously and waits until they are all complete. It returns an array of results in the same order as the calls, so it's usually pretty straightforward to use.

Both approaches have a purpose, so it's not like you "should" strictly use one. But you should be aware of both and be able to use the one that fits the need.

The third major strategy being Promise.any():

- as soon as any of them resolve (any)

- when all of them resolve (all)

- resolve each in order (for)

Promise.all/allSettled/etc will allow you to resolve your promises concurrently rather than awaiting each result in the for loop. Depends what your use case is, really.

Use my approach or yours, but never use forEach with await. I don’t remember exactly why, but it just doesn’t work and will lead to very confusing errors.

Because Array.forEach doesn’t return a value or anything that could be awaited. If you use Array.map you can create an Array of promises that can be awaited with Promise.all

Just make sure Map access is thread safe - awaiting promises is - but updating/reading map keys usually isn't.

Yep, in multithreaded langs you just want to use a ConcurrentMap or whatever the stdlib has for this. The write is very brief because you only write the promise and not eg keep a proper lock on the key until the result has arrived, so no need for a fancier datastructure.

js is single-threaded so no problem there.

Since OP mentioned Mutexes I assumed he's dealing with multithreaded code, but with JS this works as advertised :)

This is clever, otherwise you have to coordinate your cache setter and your async function call between potentially many concurrent calls. There are ways to do this coordination though, one implementation I've borrowed in practice in python is modes stampede decorator: https://github.com/ask/mode/blob/master/mode/utils/futures.p...

I believe that any time you await/then you wind up going back to the event loop. It's cheap ish but definitely not free. This is, of course, to make the behavior more predictable, but in this case it might be undesirable. Unfortunately, I don't think there's any obvious way around it. It can sorta be done with vanilla callbacks, but that takes you out of the async/promises world.

I have written many tools that do something like this, very useful.

Just gotta be careful with memory leaks, so (for a TS example) you might want to do something like this:

    const promiseMap<key, Promise<any>> = new Map();

    async function keyedDebounce<R>(key: string, fn: () => R) {
      const existingPromise = promiseMap.get(key);
      if (existingPromise) return existingPromise;

      const promise = new Promise(async (resolve) => {
        const result = await fn(); // ignore lack of error handling

        // avoid memory leak by cleaning up

        // this will call the .then or awaited promises everywhere

      promiseMap.set(key, promise);

      return promise;
So that the promise sticks around for that key while the function is active, but then it clears it so that you're not just building up keys in a map.

This is a great solution to the thundering herd problem, but isn't OP explicitly trying to cache the results for later? In that case, the promise map is a clever way to make sure the cachable value is fetched at most once, and you want the keys to build up in the map, so GC'ing them is counterproductive.

Ya it obviously depends on the intended goal, but caching items in a map indefinitely better be done on a set of values that is known to be limited to a certain size over the lifetime of the server or it'll lead to a memory leak.

If the result of the async function is not necessarily the same value every time you call it, you may want to recompute it

Please correct me if I'm wrong. But you'd better avoid creating extra promises for performance reasons

  const promiseMap<key, Promise<any>> = new Map();

  async function keyedDebounce<R>(key: string, fn: () => R) {
    const existingPromise = promiseMap.get(key);
    if (existingPromise) return existingPromise;
    const promise = fn().then(result => {
      return result;
    promiseMap.set(key, promise);
    return promise;

You check for if(promiseMap.get(key)) and in the NO case you do promiseMap.delete(key)?

Could you explain why that's necessary? (sorry probs a stupid question)

So if the promise still exists, that means that there is an active call out for a promise using this key already, so we'll just return it. We know this because when the function that is executing finishes, we delete the promise from the map.

A purpose of this might be, let's say you rebuild some structure very frequently, like hundreds of times per second or whatever. You can call this function for a given key as many times as you want while that async process is executing, and it will only execute it once, always returning the same promise.

I’m the no case, the promise is created, and then deleted after the function is resolved.

I’m not entirely sure of the use of this pattern where your using a map and deleting the keys as you go - seems like you’d end up doing more work kinda randomly depending on how the keys were added. I’d just relive the whole map when I was done with it.

Won’t JS garbage collect orphaned references? Why is this necessary?

there is some confusion here.

OP is intentionally caching results of, presumably, expensive computations or network calls. It's a cache. There is no memory leak. OP just did not detail how cache invalidation/replacement happens. The 2nd comment adds a rather rudimentary mechanism of removing items from cache as soon as they are ready. You get the benefit of batching requests that occur before the resolve, but you get the downside of requests coming in right after the resolve hitting the expensive process again. Requests don't neatly line up at the door and stop coming in as soon as your database query returns.

Typically you would use an LRU mechanism. All caches (memcached, redis, etc.) have a memory limit (whether fixed or unbounded, i.e. all RAM) and a cache replacement policy.

It does, but the map has a reference to it, so it will "leak" (in gc languages an unwanted reference is considered a leak). If this map got rather large, you could end up with a rather large heap and it would be un-obvious why at first.

Do Promises hold a reference to the chain of functions that end in the result? If so, that seems like a bug.

A promise is just an object, it does not contain any reference to the chained functions.

Could you use a WeakMap for this instead?

If the key is a type that you expect to be GC'ed, yes, totally. If the key is a simple value type eg a string then it behaves the same as a regular Map or an object.

Ah yeah, good point.

I've mostly used this overall pattern with something like Dataloader, where you throw away the entire cache on every request, so the GC problem doesn't crop up.


This might be relevant to this exact pattern

  const getThing = (async function (arg1, arg2) {
    console.log({ arg1, arg2 });
    return arg1 + (arg2 || 0);

  console.log(await getThing(1)); // New promise
  console.log(await getThing(2)); // New promise
  console.log(await getThing(1)); // Returns promise from earlier

>> Not a very deep CS-y one

yes, in fact this is more like a pattern than an actual data structure, since you might as well replace it with a list of futures. And a comparison between future based concurrency and thread based parallelism is like apples to oranges.

But it isn't surprising that this pattern ended up at the top since it is what the users of this site would be most familiar with.

Vitess does something similar, "query coalescing", which was designed for loads that are very read-heavy. If 1000 requests come in that end up making the same read query from the database, and they all came in in the same 50ms that it takes to get the results from the database, you can just return the same answer to all of them.

I wrote a version of this with Elixir: https://github.com/bschaeffer/til/tree/master/elixir/gen_ser...

Didn't know what to call it but PromiseMaps is nice name for it.

Promise Caching has a nicer name to it since that's what it is if well-implemented.

> It only works in languages where promises/futures/tasks are a first-class citizen. Eg JavaScript.

Why would that be true?

I think the actual answer is that it only works in languages where you can eagerly execute async code disconnected from the Promise/Future construct. It would've worked in ES3 using a Promise polyfill. (Which is notable because it does not fit the aforementioned criteria—it had no real async support.)

It isn’t. The first implementation of promises I worked on was in Java. We used it as part of verifying certificate chains. If you use threads it doesn’t matter if the operations are blocking, as long as you move them off-thread.

Anybody knows if there is similar pattern for Python's asyncio?

How about a dict whose values are asyncio.Future?


I wonder if Swift's AsyncAwait could be used in such a way.

Sure it's possible, we need to await for the Task if it already exists on the dictionary, for example we could imagine writing something like this inside an actor to make it threadSafe.

    private var tasksDictionary: Dictionary<String, Task<Data, Error>>

    func getData(at urlString: String) async throws -> Data {
        if let currentTask = tasksDictionary[urlString] {
            return await currentTask.value
        let currentTask = Task { 
           return try await URLSession.shared.data(from: URL(string:urlString)!) 
        tasksDictionary[urlString] = currentTask
        let result = await currentTask.value
        tasksDictionary[urlString] = nil
        return result

Hmm, why set `taskDictionary[urlString] = nil` at the bottom there? If the whole point is to cache the result, isn't the point to leave the value there for other requests to pick it up?

Yup, nice catch, no need to reset it to nil if you wanna keep it in memory indefinitely.

I guess making it nil can be also used if you don't wanna make the same request when there is one already in flight, in case you have a 403 error and need to request a refresh token, you don't wanna make two simultaneous requests, but you also don't wanna catch it indefinitely either.

Great! So much nicer than the Combine (or Rx) version.

This is great. I'm actually running into this problem but across distributed clients. Is there a distributed way to achieve somthing like this with say Redis?

Technically yes, the promise can first do an RPC to a distributed key/value store, and only then do the expensive computation (which typically is itself an RPC to some other system, perhaps a traditional database). Promise libraries should make this pretty simple to express (though always uglier than blocking calls).

I say expensive because even in a data center, an RPC is going to take on the order of 1ms. The round-trip from a client like a browser is much greater especially where the client (if this is what you meant by "client") has a poor internet connection.

Therefore this pattern is usually done inside of an API server in the data-center. Additional benefits of the server side approach is that an API server can better control the cache keys (you could put the API server "build number" in the cache key so you can do rolling updates of your jobs), TTL, etc. Also, you don't want a rogue computer you don't control to directly talk to a shared cache since a rogue computer could put dangerous values into your cache, remove elements from the cache they shouldn't, etc.

What do you mean by first class citizen, here? I’m pretty sure this works in all languages with promises, but I might be misunderstanding something.

The data structure here is a Map<T>

can you please explain this a bit more how this avoid fetching the same value twice? maybe with example.

It sounds like, if you have a process that takes a long time to do, you check to see if you already have initiated that process (see if the key is in the list of promises), rather than firing off another promise for the same request.

If you get 10 requests to fire function lookUpSomethingThatTakesAWhile(id) that returns a promise, rather than firing off 10 promises, you fire it off once, look up the ID in your map for the next nine, and return that same promise for the next 9.

What is a promise in this context?

A promise is an object representing an asynchronous result. Some languages, such as Python, call them futures. A promise object will typically have methods for determining if result is available, getting the result if it is, and registering callback functions to be called with the result when it's available.

Thank you!

More generally, I believe this is a monad structure.

Promises are not monads, for one simple reason: they're not referentially transparent. The whole point of OP's example is to double down and take advantage of that.

Referential transparency is a property of expressions, not type classes. Referential transparency is nowhere in the definition of a monad because a monad is obviously not an expression but a type class.

It is definitely true that Promise is a data type that _could_ admit a monad instance. It has:

- a data type M a which is Promise a

- a function pure with the signature a => M a, which is x => Promise.resolve(x)

- a bind function with the signature M a => a => M b => M b which is essentialy `then`

But a monad requires three properties to hold true: Right identity, Left identity and associativity. Right identity holds for every function `f: a => M b`, but left identity and associativity do not hold.

Maybe comonads fit better

promise map?

do you mean

    setmetatable({}, {__index = function(tab, key)
          local co = coroutine.running()
          uv.read(fd, function(err, result)
               tab[key] = result
               return coroutine.resume(result)
          return coroutine.yield()
      end })
This is an incomplete implementation, clearly: but I've never missed Promises with coroutines and libuv.

Cache-Oblivious Data Structures: https://cs.au.dk/~gerth/MassiveData02/notes/demaine.pdf

A vaguely related notion is that naive analysis of big-O complexity in typical CS texts ignores over the increasing latency/cost of data access as the data size grows. This can't be ignored, no matter how much we would like to hand-wave it away, because physics gets in the way.

A way to think about it is that a CPU core is like a central point with "rings" of data arranged around it in a more-or-less a flat plane. The L1 cache is a tiny ring, then L2 is a bit further out physically and has a larger area, then L3 is even bigger and further away, etc... all the way out to permanent storage that's potentially across the building somewhere in a disk array.

In essence, as data size 'n' grows, the random access time grows as sqrt(n), because that's the radius of the growing circle with area 'n'.

Hence, a lot of algorithms that on paper have identical performance don't in reality, because one of the two may have an extra sqrt(n) factor in there.

This is why streaming and array-based data structures and algorithms tend to be faster than random-access, even if the latter is theoretically more efficient. So for example merge join is faster than nested loop join, even though they have the same performance in theory.

Realtime collision detection[1] has a fantastic chapter in this with some really good practical examples if I remember right.

Great book, I used to refer to it as 3D "data structures" book which is very much in theme with this thread.

[1] https://www.amazon.com/Real-Time-Collision-Detection-Interac...

The implicit grid data structure from there is a personal favorite of mine. I used it in a game once and it performed incredibly well for our use case.

It's a bit too complicated to totally summarize here, but it uses a bit per object in the scene. Then bit-wise operations are used to perform quick set operations on objects.

This data structure got me generally interested in algorithms that use bits for set operations. I found the Roaring Bitmap github page has a lot of useful information and references wrt this topic: https://github.com/RoaringBitmap/RoaringBitmap

I spent a lot of time optimizing an octree for ray tracing. It is my favorite "spatial index" because it's the only one I know that can be dynamically updated and always results in the same structure for given object placement.

It's a stellar book. I work on a commercial realtime physics engine, the "orange book" is practically required reading here.

I often recommend this book, RTCD, as a "better intro to data structures and algorithms" (sometimes as "the best"). The advice often falls on deaf ears, unfortunately.

Seconding the RTCD recommendation. Handy code examples, and my favorite part is that the book is real-world-performance-conscious (hence the "real-time" in the title).

> In essence, as data size 'n' grows, the random access time grows as sqrt(n), because that's the radius of the growing circle with area 'n'.

I was about to write a comment suggesting that if we made better use of three dimensional space in constructing our computers and data storage devices, we could get this extra latency factor down to the cube root of n.

But then, I decided to imagine an absurdly large computer. For that, one has to take into account a curious fact in black hole physics: the radius of the event horizon is directly proportional to the mass, rather than, as one might expect, the cube root of the mass [1]. In other words, as your storage medium gets really _really_ big, you also have to spread it out thinner and thinner to keep it from collapsing. No fixed density is safe. So, in fact, I think the extra factor for latency in galactic data sets is neither the square root nor cube root, but n itself!

[1] https://en.wikipedia.org/wiki/Schwarzschild_radius

The main difficulty with growing computing devices in three dimensions is cooling. All that heat has to be somehow carried away. It's already difficult with 2D devices. One would have to incorporate channels for cooling liquids into the design, which takes a bite out of the gains from 3D.


This series of blog posts explains this sqrt behaviour quite nicely, covering both theory (including black holes) and practice:


If its stored far away enough re-doing the calculation is faster. Depending on how much accuracy you need you might as well load a 2022-07-22 07:26:34 earth and have that guy on HN evaluate his thoughts all over again.

How do you accurately include those sqrt(n)’s in your analysis?

I spent a long time looking at an algorithm for long range force calculations called the Fast Multipole Method which is O(N) and found that practically it couldn't compete against an O(N log N) method we used that involved FFTs because the coefficient was so large that you'd need to simulate systems way bigger than is feasible in order for it to pay off because of cache locality, etc.

Honestly, that is not too uncommon. For top level algorithm choice analysis, "rounding" O(log(n)) to O(1), and O(n*log(n)) to O(n), is often pretty reasonable for algorithm comparison, even though it is strictly incorrect.

It is not uncommon for a simple O(n*log(n)) algorithm to beat out a more complex O(n) algorithm for realistic data sizes. If you "round" them to both O(n), then picking the simpler algorithm because the obvious choice, and can easily turn out to have better perf.

To be fair, in the general case (particles placed in arbitrary positions), it worked very well. The issue was that in the area of electromagnetics I worked on at the time, we make an assumption about things being placed on a regular grid, and that allows you to do the force calculation by convolving a kernel with the charges/dipole strength in Fourier space.

A helpful little heuristic that always stuck with me from a Stanford lecture by Andrew Ng was, "log N is less than 30 for all N".

This implementation and associated paper improve the cache locality to make the algorithm more competitive https://github.com/dmalhotra/pvfmm

They compare solving toy problems with different methods here https://arxiv.org/pdf/1408.6497.pdf

But as you mention, this is just fiddling with constants, and it still may be too costly for your problem

I saw an amazing presentation on Purely Functional Data structures with a really approachable and understandable proof on ho w a particular structure was asymptotically constant or linear time (I believe), and then followed up by saying that in practice none of it worked as well as a mutable version because of caching and data locality.

Oh well.

Is there by any chance a publicly available recording of this presentation on purely functional data structures?

NE Scala Symposium Feb 2011

http://vimeo.com/20262239 Extreme Cleverness: Functional Data Structures in Scala

or maybe

http://vimeo.com/20293743 The Guerrilla Guide to Pure Functional Programming

(At work, so I can't check the video)

Both of them were excellent speakers

Does anyone actually use cache-oblivious data structure in practice? Not, like, "yes I know there's a cache I will write a datastructure for that", that's common, but specifically cache-oblivious data structures?

People mention them a lot but I've never heard anyone say they actually used them.

To an extent, the B-Tree data structure (and its variants) are cache oblivious. They smoothly improve in performance with more and more cache memory, and can scale down to just megabytes of cache over terabytes of disk.

The issue is that the last tree level tends to break this model because with some caching models it is "all or nothing" and is the biggest chunk of the data by far.

There are workarounds which make it more truly cache oblivious. How often this is used in production software? Who knows...

There are also Judy arrays: https://en.wikipedia.org/wiki/Judy_array

This sounded promising but then I saw a performance benchmark [1] that was done on a pentium(!) with a very suboptimal hash table design compared to [2] and [3].

The Judy site itself [4] is full of phrases like "may be out of date", when the entire site was last updated in 2004. If it was already out of date in 2004...

It seems that Judy arrays are just kind of a forgotten datastructure, and it doesn't look promising enough for me to put in effort to revive it. Maybe someone else will?

[1] http://www.nothings.org/computer/judy/

[2] https://probablydance.com/2017/02/26/i-wrote-the-fastest-has...

[3] https://probablydance.com/2018/05/28/a-new-fast-hash-table-i...

[4] http://judy.sourceforge.net/

Judy arrays are (under the covers) a kind of radix tree, so they are comparable to things like ART (adaptive radix tree), HAMT (hashed array-mapped trie), or my qp-trie. Judy arrays had some weird issues with patents and lack of source availability in the early days. I think they are somewhat overcomplicated for what they do.

Regarding cache-friendiness, a neat thing I discovered when implementing my qp-trie is that it worked amazingly well with explicit prefetching. The received wisdom is that prefetching is worthless in linked data structures, but the structure of a qp-trie makes it possible to overlap a memory fetch and calculating the next child, with great speed benefits. Newer CPUs are clever enough to work this out for themselves so the explicit prefetch is less necessary than it was. https://dotat.at/prog/qp/blog-2015-10-11.html

I guess I partly got the idea for prefetching from some of the claims about how Judy arrays work, but I read about them way back in the 1990s at least 15 years before I came up with my qp-trie, and I don't know if Judy arrays actually use or benefit from prefetch.

It seems to me that the project is still active,: https://sourceforge.net/p/judy/bugs/29/

Cache-obliviousness is an important part of the whole array-of-structs vs structs-of-arrays. One of the advantages of the struct-of-arrays strategy is that it is cache-oblivious.

Cache oblivious data structures are absolutely used in every serious database.

From what I remember/can find online (don't take this as a reference): B-tree: relational DB, SQLite, Postgres, MySQL, MongoDB. B+-tree: LMDB, filesystems, maybe some relational DBs. LSM trees (Log-structured merge tree, very cool) for high write performance: LevelDB, Bigtable, RocksDB, Cassandra, ScyllaDB, TiKV/TiDB. B-epsilon tree: TokuDB, not sure if it exists anymore. COLA (Cache-oblivious lookahead array): I don't know where it's used.

Maybe modern dict implementations can qualify too, e.g. Python.

Yes, LSM trees are specifically designed with this propery in mind.

Doing a decent job of analysis would include the sqrt(n) memory access factor. Asymptotic analysis ignores constant factors, not factors that depend on data size. It's not so much 'on paper' as 'I decided not to add a significant factor to my paper model'.

Cache oblivious algorithms attempt to make the memory access term insignificant so you could continue to ignore it. They are super neat.

> the increasing latency/cost of data access as the data size grows

Latency Numbers Everyone Should Know https://static.googleusercontent.com/media/sre.google/en//st...

The ratios are mostly correct, but the absolute numbers are wrong these days. For example, memory read latency can be 45ns (not 100) with a decent DDR4 kit and data center latency can be 10us (not 500us) with infiniband.

Do we mean different things by "merge join" and "nested loop join" ? For me "merge join" is O(n) (but requires the data to be sorted by key) whereas "nested loop join" is O(n^2).

Wouldn't you at least be looking at nlog(n) for the sort in the merge join?

Yes, if the data is not already sorted. Thus it's O(n) for already sorted data and O(n log(n) + n) -- which simplifies to O(n log(n)) -- for arbitrary data.

Yeah. I know. Why would the data already be sorted?

In a database you might have already built an index on that data.

I was experimenting a while ago with something that I think is related to this.

If you have a quadratic algorithm where you need to compare each pair of objects in a large array of objects, you might use a loop in a loop like this:

    for (int i = 0; i < size; i++) {  for (int j = 0; j < size; j++) { do_something(arr[i], arr[j]); }  }
When this algorithm runs, it will access every cache line or memory page in the array for each item, because j goes through the whole array for each i.

I thought that a better algorithm might do all possible comparisons inside the first block of memory before accessing any other memory. I wanted this to work independently of the size of cache lines or pages, so the solution would be that for each power of 2, we access all items in a position lower than that power of 2 before moving to the second half of the next power of 2.

This means that we need to first compare indexes (0,0) and then (1,0),(1,1),(0,1) and then all pairs of {0,1,2,3} that haven't yet been compared and so on.

It turns out that the sequence (0,0),(1,0),(1,1),(0,1) that we went through (bottom-left, bottom-right, top-right, top-left in a coordinate system) can be recursively repeated by assuming that the whole sequence that we have done so far is the first item in the next iteration of the same sequence. So the next step is to do (0,0),(1,0),(1,1),(0,1) again (but backwards this time) starting on the (2,0) block so it becomes (2,1),(3,1),(3,0),(2,0) and then we do it again starting on the (2,2) block and then on the (0,2) block. Now we have done a new complete block that we can repeat again starting on the (4,0) block and then (4,4) and then (0,4). Then we continue like this through the whole array.

As you can see, every time we move, only one bit in one index number is changed. What we have here is actually two halves of a gray code that is counting up, half of it in the x coordinate and the other half in y. This means that we can iterate through the whole array in the described way by making only one loop that goes up to the square of the size and then we get the two indexes (x and y) by calculating the gray code of i and then de-interleaving the result so that bits 0,2,4... make the x coordinate and bits 1,3,5... make y. Then we compare indexes x and y:

    for (int i = 0; i < size * size; i++) { get_gray_xy(i, &x, &y); do_something(arr[x], arr[y]); }
The result of all this was that the number of cache line/page switches went down by orders of magnitude but the algorithm didn't get faster. This is probably for two reasons: incremental linear memory access is very fast on modern hardware and calculating the gray code and all that added an overhead.

While reading your post I kept thinking "de-interleave the bits of a single counter" since I have used that before to great benefit. It does suffer from issues if your sizes are not powers of two, so your grey code modification seems interesting to me. I'll be looking in to that.

FYI this also extends to higher dimensions. I once used it to split a single loop variable into 3 for a "normal" matrix multiplication to get a huge speedup. Unrolling the inner 3 bits is possible too but a bit painful.

Also used it in ray tracing to scan pixels in Z-order, which gave about 10 percent performance improvement at the time.

But yes, I must look at this grey code step to see if it makes sense to me and understand why its not helping you even with reduced cache misses.

Interestingly I get more cache misses with the gray code solution but still fewer switches between cache lines. This has to be because the cache line prefetcher can predict the memory usage in the simple version but it's harder to predict in the gray code or de-interleave version.

Also, I have never thought about de-interleaving bits without using gray code but that seems to work similarly. The main reason I used gray code is that then we don't have to switch memory blocks as often since only one of the indexes changes every time so one of the current memory blocks can always stay the same as in the previous iteration.

This is similar to what I'm working on.

I am working on machine sympathetic systems. One of my ideas is concurrent loops.

Nested loops are equivalent to the Nth product of the loop × loop × loop or the Cartesian product.

  for letter in letters:
   for number in numbers:
    for symbol in symbols:
     print(letter + number + symbol)
If len(letters) == 3, len(numbers) == 3, len(symbols) == 3. If you think of this as loop indexes "i", "j" and "k" and they begin at 000 and go up in kind of base 3 001, 001, 002, 010, 011, 012, 020, 100 and so on.

The formula for "i", "j" and "k" is

  N = self.N
  for index, item in enumerate(self.sets):
    N, r = divmod(N, len(item))
So the self.N is the product number of the nested loops, you can increment this by 1 each time to get each product of the nested loops.

We can load balance the loops and schedule the N to not starve the outer loops. We can independently prioritize each inner loop. If you represent your GUI or backend as extremely nested loops, you can write easy to understand code with one abstraction, by acting as if loops were independent processes.

So rather than that sequence, we can generate 000, 111, 222, 010, 230, 013 and so on.

Load balanced loops means you can progress on multiple items simultaneously, concurrently. I combined concurrent loops with parallel loops with multithreading. I plan it to be a pattern to create incredibly reactive frontends and backends with low latency to processing. Many frontends lag when encrypting, compressing or resizing images, it doesn't need to be that slow. IntelliJ doesn't need to be slow with concurrent loops and multithreading.

See https://github.com/samsquire/multiversion-concurrency-contro...

Loops are rarely first class citizens in most languages I have used and I plan to change that.

I think I can combine your inspiration of gray codes with this idea.

I think memory layout and data structure and algorithm can be 3 separate decisions. I am yet to see any developers talk of this. Most of the time the CPU is idling waiting for memory, disk or network. We can arrange and iterate data to be efficient.

I decided to try implement get_gray_xy. I am wondering how you can generalise it to any number of loops and apply it to nested loops of varying sizes, that is if you have three sets and the sizes are 8, 12, 81.

Wikipedia says graycodes can be found by num ^ (num >> 1)

  a = [1, 2, 3, 4]
  b = [2, 4, 8, 16]
  indexes = set()
  correct = set()

  print("graycode loop indexes")
  for index in range(0, len(a) * len(b)):
    code = index ^ (index >> 1)
    left = code & 0x0003
    right = code >> 0x0002 & 0x0003
    print(left, right)
    indexes.add((left, right))
  assert len(indexes) == 16
  print("regular nested loop indexes")
  for x in range(0, len(a)):
    for y in range(0, len(b)):
      print(x, y)
  assert correct == indexes
This produces the sequence:

    graycode loop indexes
    0 0
    1 0
    3 0
    2 0
    2 1
    3 1
    1 1
    0 1
    0 3
    1 3
    3 3
    2 3
    2 2
    3 2
    1 2
    0 2
    regular nested loop indexes
    0 0
    0 1
    0 2
    0 3
    1 0
    1 1
    1 2
    1 3
    2 0
    2 1
    2 2
    2 3
    3 0
    3 1
    3 2
    3 3

Here is the implementation of get_gray_xy that I used:

    static uint64_t deinterleave(uint64_t x) {
        x = x & 0x5555555555555555;
        x = (x | (x >>  1)) & 0x3333333333333333;
        x = (x | (x >>  2)) & 0x0f0f0f0f0f0f0f0f;
        x = (x | (x >>  4)) & 0x00ff00ff00ff00ff;
        x = (x | (x >>  8)) & 0x0000ffff0000ffff;
        x = (x | (x >> 16)) & 0x00000000ffffffff;
        return x;
    static void get_gray_xy(uint64_t n, uint64_t *x, uint64_t *y) {
        uint64_t gray = n ^ (n >> 1);
        *x = deinterleave(gray);
        *y = deinterleave(gray >> 1);
I don't think the gray code solution can work well for sets of different sizes because the area of indexes that you go through grows as a square.

But it does work for different number of dimensions or different number of nested loops. For 3 dimensions each coordinate is constructed from every 3rd bit instead of 2nd.

So if you have index 14, then gray code is 9 (1001) which means in two dimensions x=1,y=2 and in three dimensions, x=3,y=0,z=0.

Thank you for sharing your idea and your code and thoughts and thanks for your reply.

I would like to generalise it for sets of different sizes. Maybe it's something different to a gray codes which you taught me today, it feels that it should be simple.

Maybe someone kind can step in and help me or I can ask it on Stackoverflow.

I need to think it through.

At one point in time I was reading a section on Intel's website of cache optimisation using a tool that shows you the cache behaviour of the CPU and column or row major iteration it had a GUI. I found it very interesting but I cannot seem to find it at this time. I did find some medium article of Cachediff.

deinterleave doesn't actually produce gray code does it? The GP said to convert to gray code before doing the deinterleave.

> but the algorithm didn't get faster

Could it be branch predictor at play? https://stackoverflow.com/questions/11227809/why-is-processi...

The union-find data structure / algorithm is useful and a lot of fun.

The goal is a data structure where you can perform operations like "a and b are in the same set", "b and c are in the same set" and then get answers to questions like "are a and c in the same set?" (yes, in this example.)

The implementation starts out pretty obvious - a tree where every element either points at itself or some thing it was merged with. To check if two elements are in the same set, check if they have the same parent. Without analyzing it, it sounds like you'll average findRoot() performance of O(log(n)), worst-case O(n).

There are a couple of simple optimizations you can do to this structure, the type of things that seem like they shouldn't end up affecting asymptotic runtime all that much. The first is that, whenever you find a root, you can re-parent all the nodes you visited on the way to that root, so they'll all be quicker to look up next time. The other is that you keep track of the size of sets, and always make the larger set be the parent of the smaller.

And neither of those actually do anything impressive alone, but if you use both, the algorithm suddenly becomes incredibly fast, with the slowest-growing (non-constant) complexity I've ever heard of: O(the inverse of the Ackermann function(n)). Or, for any reasonable N, O(4 or less).

Agreed, union-find is great.

My favourite usage of it in practice is for solving first-order (syntactic) unification problems. Destructive unification by means of rewriting unbound variables to be links to other elements is a very pretty solution - especially when you consider the earlier solutions such as that of Robinson's unification which, when applied, often involves somewhat error-prone compositions of substitutions (and is much slower).

The fact that the forest of disjoint sets can be encoded in the program values you're manipulating is great (as opposed to, say, the array-based encoding often taught first): makes it very natural to union two elements, chase up the tree to the set representative, and only requires minor adjustment(s) to the representation of program values.

Destructive unification by means of union-find for solving equality constraints (by rewriting unbound variables into links and, thus, implicitly rewriting terms - without explicit substitution) makes for very easy little unification algorithms that require minimal extension to the datatypes you intend to use and removing evidence of the union-find artefacts can be achieved in the most natural way: replace each link with its representative by using find(l) (a process known as "zonking" in the context of type inference algorithms).

Basic demonstration of what I'm talking about (the incremental refinement of the inferred types is just basic union-find usage): https://streamable.com/1jyzg8

I remember Ravelin shared their story of using it in card fraud detection code.

Basically they managed to replace a full-fledged graph database with small piece of code using union-find in Go. Was a great talk.


My writeup of union find is in https://dercuano.github.io/notes/incremental-union-find.html. I did not in fact find a way to make it efficiently support incremental edge deletion, which is what I was looking for.

> I did not in fact find a way to make it efficiently support incremental edge deletion, which is what I was looking for.

I don't understand this goal. The interior connections aren't relevant to a union-find structure; ideally you have a bunch of trees of depth 2 (assuming the root is at depth 1), but the internal structure could be anything. That fact immediately means that the consequence of removing an edge is not well-defined - it would separate the two objects joined by the edge, and it would also randomly divide the original connected set into two connected sets, each containing one of those two objects. Which object each "third party" object ended up being associated with would be an artifact of the interior structure of the union-find, which is not known and is constantly subject to change as you use the union-find.

If all you want is to be able to remove a single object from a connected set, on the assumption that your union-find always has the ideal flat structure, that's very easy to do - call find on the object you want to remove, which will link it directly to the root of its structure, and then erase that link.

Union find takes as input some undirected graph <N, E> and internally constructs (and progressively mutates) a directed graph <N, E'> which it uses to efficiently answer queries about whether two nodes n₁, n₂ ∈ N are in the same connected component of <N, E>. It additionally supports incrementally adding edges to E.

My quest was to find a way to incrementally delete edges from E, not E'. You're talking about deleting edges from E', which I agree is not generally a useful thing to do.

For example, your N might be the cells of a maze map, and your E might be the connections between adjacent cells that are not separated by a wall. In that case you can tear down a wall and add a corresponding edge to E. But it would be nice in some cases to rebuild a wall, which might separate two previously connected parts of the maze. I was looking for an efficient way to handle that, but I didn't find one.

I don't think there's a way to extend union-find to do what you want in the general case. You might have more luck starting with a minimum cut algorithm.

For the specific case where you can identify some of your edges are more or less likely to be deleted, though, you can run union-find on only the stable edges, cache that result, and then do the unstable edges. Whenever an unstable edge is deleted, reload the cache and redo all the unstable edges. This works for something like a maze with open/closable doors.

Those are some interesting ideas, thanks! I hadn't thought about trying to apply minimum-cut to the problem!

> Union find takes as input some undirected graph <N, E> and internally constructs (and progressively mutates) a directed graph <N, E'> which it uses to efficiently answer queries about whether two nodes n₁, n₂ ∈ N are in the same connected component of <N, E>.

I don't think this is a valuable way to think about the structure. That's not what it's for.

A union-find is a direct representation of the mathematical concept of an equivalence relation. The input is a bunch of statements that two things are equal. It will then let you query whether any two things are or aren't equal to each other.

This is captured well by the more formalized name "disjoint-set structure". You have a set of objects. You can easily remove an element from the set. But it makes no conceptual sense to try to "separate two of the elements in a set". They're not connected, other than conceptually through the fact that they are both members of the same set.

A union-find is a good tool for answering whether two vertices are in the same connected component of a graph because being in the same connected component of a graph is an equivalence relation, not because the union-find is a graph-related concept. It isn't, and there is no coherent way to apply graph-theoretical concepts to it.

Putting things another way:

You describe a union-find as something used to "efficiently answer queries about whether two nodes n₁, n₂ ∈ N are in the same connected component of [a graph]".

But you focused on the wrong part of that statement. What makes a union-find appropriate for that problem is the words "the same", not the words "connected component".

You are welcome to think about the union-find structure however you prefer to think about it, but I was describing the problem I was trying to solve, for which the correct description of union find I gave above is optimal. Contrary to your mistaken assertion, no contradictions arise from applying graph-theoretical concepts in this way; there is no problem of "coherency". It's just a form of description you aren't accustomed to.

If your way of thinking about union find makes it hard for you to understand the problem I was trying to solve, maybe it isn't the best way to think about it for the purpose of this conversation, even if it is the best way to think about it in some other context.

I'm not claiming my description is the only correct description of union find, just that it's the one that most closely maps to the problem I was trying to solve.

The description can equally well be formulated in the relational terms you prefer: given a relation R, union find can efficiently tell you whether, for any given x and y, (x, y) ∈ R', where R' is the symmetric transitive reflexive closure of R (and is thus the smallest equivalence relation containing R). It efficiently supports incrementally extending R by adding new pairs (a, b) to it.

This is equivalent to my graph-theoretic explanation above except that it omits any mention of E'. However, it is longer, and the relationship to the maze-construction application is less clear. Perhaps you nevertheless find the description clearer.

What I was looking for, in these terms, is a variant of union find that also efficiently supports incrementally removing pairs from R.

Does that help you understand the problem better?

— ⁂ —

In general there is a very close correspondence between binary relations and digraphs, so it's usually easy to reformulate a statement about relations as an equivalent statement about digraphs, and vice versa. But one formulation or the other may be more perspicacious.

More generally still, as you move beyond the most elementary mathematics, you will learn that it's usually a mistake to treat one axiom system or vocabulary as more fundamental than another equivalent axiom system, since you can derive either of them from the other one. Instead of arguing about which one is the right axiom system or vocabulary, it's more productive to learn to see things from both viewpoints, flexibly, because things that are obvious from one viewpoint are sometimes subtle from the other one.

> You are welcome to think about the union-find structure however you prefer to think about it, but I was describing the problem I was trying to solve, for which the correct description of union find I gave above is optimal.

> If your way of thinking about union find makes it hard for you to understand the problem I was trying to solve, maybe it isn't the best way to think about it for the purpose of this conversation, even if it is the best way to think about it in some other context.

I'm not having any problems understanding the problem you were trying to solve. My problems are in the area of understanding why you thought a union-find might be an effective way to address that problem.

I'm telling you that, if you adjust your view of what a union-find is, you will have a better conception of where it can and can't be usefully applied.

> In general there is a very close correspondence between binary relations and digraphs, so it's usually easy to reformulate a statement about relations as an equivalent statement about digraphs, and vice versa. But one formulation or the other may be more perspicacious.

In this case, the relation is symmetric by definition, so you just have graphs. Yes, it's obvious how you can use a graph to describe a relation.

But the entire point of the union-find structure is to discard any information contained in an equivalent graph that isn't contained in the relation. (And there are many equivalent graphs, but only one relation!) The absence of that information is what makes the structure valuable. It shouldn't be a surprise that, if you want to preserve information from a particular graph, you are better served with graph-based algorithms.

Well, that's an interesting way to look at it; I see better where you're coming from now.

But I wasn't trying to get the unmodified data structure to handle deletion efficiently; I was trying to use it as a basis to design a structure that could. I thought I saw a minor modification that achieved this, but then I found out why it doesn't work. That's what my linked note above is about.

More generally, if you only try things that will probably work, you'll never achieve anything interesting.

Can you give some examples where union-find is applied to great benefit?

A while back I had to perform an entity resolution task on several million entities.

We arrived at a point in the algorithm where we had a long list of connected components by ids that needed to be reduced into a mutually independent set of components, e.g. ({112, 531, 25}, {25, 238, 39, 901}, {43, 111}, ...)

After much head banging about working out way to do this that wouldn't lead to an out-of-memory error, we found the Disjoint Set Forest Data Structure and our prayers were answered.

I was very grateful for it!

Exactly the same for me. We had a defect management system, and a defect can be connected to others as dupes. We'd have whole chains of dupes. We used union find to identify them.

One example is connected component analyis on images. In one linear scan of an image (2d or 3d) you can find all connected blobs in an image.

I used it to find and track islands of connected polygons in 3D meshes.

a textbook example of an algorithm that works very well with union-find is the kruskal's algorithm to find the minimum weight spanning tree given a graph. Using union-find improves the time complexity of the algorithm from O(V²) to O(ElogV).

This happens because kruskal's algorithm essentially selects the cheapest edge not already included in our spanning tree that won't cause a cycle. So union-find is able to speed up this potential cycle check which would otherwise be naively quadratic.

Oh, huh, this is serendipitously handy because I need to be doing something like that this weekend (translating from Go to Rust - I already have a reparenting solution using hashes but hopefully this is easier / better.)

I have always wanted to really understand this data structure. Sure, I can follow the analysis with the potential functions and all, but I never really understood how Tarjan came up with the functions in the first place. Does anybody have a resource which intuitively explains the analysis?

This might a long read but it is well-written reader-friendly analysis of Tarjan's proof with chain of reasoning. https://codeforces.com/blog/entry/98275

Robert Sedgewick has a great explanation in his Algorithms book. He has gorgeous diagrams alongside his explanations.

Is it in an old edition? I just downloaded an extremely legitimate version of Algorithms 4th edition but it has no section on union data structure.

It should be in Chapter 1 Section 5 of the 4th edition. It's even in the table of contents.

Sorry, my bad. I searched for the word "disjoint" expecting it to be prominent in this section but somehow this word only appears 3 times in this book non of which are in this section!

Anyway, while the figures in this work were indeed gorgeous, it was not what I was looking for. It did not even contain a non-intuitive analysis of the inverse ackermann complexity, much less an intuitive one!

Here's a writeup I made a while ago: https://www.overleaf.com/read/hjbshsqmjwpg

We use it to wire different outputs and inputs to gates in a zero-knowledge proof circuit!

I have heard stories about annotating the edges with elements from a group rather than using unlabelled edges. Have you heard anything about that?

You can efficiently maintain count/max/sum/hash/etc. for each component. It's occasionally useful.

I wrote a beginner's guide on this topic a while ago https://medium.com/nybles/efficient-operations-on-disjoint-s...

This sounds super useful thanks! I think I'll have something this can be used for, with individual claims of whether two elements are in the same set, then easily getting the largest set of equal items.

If at some point you have a cycle, you can then reparent and remove the cycle, right? This structure in general then can encode transitive relations effectively?

Something like “a and b …” “b and c …” “c and a …”.

> If at some point you have a cycle, you can then reparent and remove the cycle, right?

You'd never have a cycle. The way a cycle would theoretically arise would have to be joining something to its own child. But you don't naively join the two objects you're given - you find their root objects, and join those. If - as would be necessary for a cycle to form - they have the same root, you can return without doing anything.

If we call

then we should end up with this structure:

       / \
      b   a
With parents on the right, the structure will be a-b after the first call and a-b-c after the second call. The reason we end up with a shallower tree after the third call is that when a is passed to unite, we call find on it and it gets reparented directly to c. Then, c (the representative for c) and c (the representative for a) are already related, so we don't adjust the structure further.

> This structure in general then can encode transitive relations effectively?

Well... no. This structure embodies the concept of an equivalence relation. You wouldn't be able to use it to express a general transitive relation, only a transitive relation that is also symmetric and reflexive.


Is there a name for the data structure/algorithm?

It's called union-find, but also disjoint-set sometimes. The process of reparenting nodes is called path compression, and the optimization of using the larger set of the two as the parent when merging is usually just called "union by rank." Any of those terms should give you more details upon search.

I remember getting this in an interview without ever coming across it.. basically, fuck union find.

Wouldn't that be extremely useful and a polynomial solution for 3-SAT, which is NP complete?

I think I understand why you're confused here. To the extent that union-find is similar to SAT, it's similar to 2-SAT: the constraints you're reasoning about are defined only on pairs of elements, whereas 3-SAT is fundamentally reasoning about triplets (at least one of these three variables is true). Notably, 2-SAT is a variant of SAT that is polynomial in time, unlike 3-SAT.

SAT complexity results from blowing up a reasonable number of variables/clauses (very rarely equivalent to others) in the formula to an enormous number of variable assignments (too many to store, and having only three classes, formula true or false or not evaluated so far, that are not useful equivalencies). What disjoint sets are you thinking of tracking?

How is it a solution to 3-SAT?

(Fantastic post idea OP. One of the best I've ever seen :D)

Related to bloom filters, xor filters are faster and more memory efficient, but immutable.

HyperLogLog is an efficient way to estimate cardinality.

Coolest thing I've learned recently was Y-fast trie. If your dataset M is bounded integers (say, the set of all 128 bit numbers), you get membership, predecessor, or successor queries in log log time, not log, like in a normal tree.

see: https://www.youtube.com/playlist?list=PLUl4u3cNGP61hsJNdULdu... (6.851 Advanced Data Structures, Erik Demaine)

Would love to learn more "stupidly fast at the cost of conceptual complexity" things.


(adding) I don't know a name for it, because it's not one thing but a combination, but once can:

Use the first N bits from a very quick hash of a key from an unknown distribution (say a file path, or a variable name, or an address, or a binary blob,..) as a way to "shard" this key across M other fast data structures (like a tree) for search/add/remove. By changing M, you can tune the size of the terms in the O(1)+O(log) equation for running time.

Trees getting too deep for fast search? Every increase of N moves the computation from the log search of the tree to the space tradeoff of having more trees.

Added benefit is you can scale to multiple threads easily. Instead of locking the whole tree, you lock a tiny sub-tree.

Very clever. (I first saw in the Aerospike key value store)

If you enjoyed XOR filters, you might also like ribbon filters, something that I had the pleasure of working on last year. They share the basic idea of using a system of linear equations, but instead of considering 3 random positions per key, the positions to probe are narrowly concentrated along a ribbon with a typical width of 64. This makes them far more cache-efficient to construct and query.

By purposefully overloading the data structure by a few per cent and bumping those items that cannot be inserted as a result of this overloading to the next layer (making this a recursive data structure), we can achieve almost arbitrarily small space overheads: <1% is no problem for the fast configurations, and <0.1% overhead with around 50% extra runtime cost. This compares to around 10% for XOR filters and ≥ 44% for Bloom filters.

In fact, I'm going to present them at a conference on Monday - the paper is already out: https://drops.dagstuhl.de/opus/volltexte/2022/16538/pdf/LIPI... and the implementation is at https://github.com/lorenzhs/BuRR/. I hope this isn't too much self-promotion for HN, but I'm super hyped about this :)

This made my day, thank you so much. :) Your explanation makes sense and it sounds brilliant/clever yet obvious. (like many good ideas are)

I'm reading the paper and looking at your github now, and look forward to "github/academia" stalking you in the future. Skimming your list of repositories and seeing a lot of stuff I understand and could possibly use. ;-)

(I find it to be a useful strategy to, when I find a clever concept in a paper, or in code on github, then look at all the other things done by the same person, or the works of people they co-author with. "collaborative filtering" for ideas.)

Y-fast tries are some of my favorites. I think they are heavily under utilized in modern terms. They sat by the the wayside for a long term because datasets where relatively small, at the time they where created ram didn't exist, and bitwise operations where inefficient along with many other constant factors.

Today; however, a lot of people have datasets on the order of 2^16 or 2^32 keys they need to maintain. And efficient bitwise operations (on upto 512 bits) are the norm. Y-fast tries are faster than b-trees or other datastructures. Also because they divide _space_ not the _data_ they are very amenable to multi-threaded and distributed algorithms. For example the hash-tables in a y-fast trie can actually be a rendezvous hash pointing to a database node. Once in that node you can hash on cores again to get to a local process for example.

I want to hear more, esp about the distributed applications, do you have any useful links, or can I buy you an "e coffee" to pick your brain for a few minutes?

> Also because they divide _space_ not the _data_ they are very amenable to multi-threaded and distributed algorithms.

Ha, I was thinking about this idea the other day and couldn’t figure out a good way to search for anything already written on the topic. I suspect there’s quite a lot of ground to be gained in the area.

Big fan of HLL

Apache foundation has a fantastic DataSketches library that includes HLL and many other powerful data analytics algorithms: https://datasketches.apache.org/

Lee Rhodes has done an excellent introduction to this library - explaining some of the use cases, advantages, and things to be aware of when using these techniques: https://www.youtube.com/watch?v=nO9pauS-mGQ

On sketches, there is a genre of structure for estimating histogram-like statistics (median, 99th centile, etc) in fixed space, which i really like. Two examples:

t-digest https://github.com/tdunning/t-digest

DDSketch https://github.com/DataDog/sketches-java

In the same spirit there is also Circllhist from Circonus. https://arxiv.org/abs/2001.06561

There was a good discussion on approaches to store histogram here: https://news.ycombinator.com/item?id=31379383

t-digest is used by AWS X-ray to render their charts

A fantastic thing about HyperLogLog is that it can be merged, so you can split your data between multiple server, precompute HLL for all IPs every minute, and then ask "how many unique IPs was there yesterday".

Discovered HLL because it's used in ClickHouse, which employ a ton of cool but obscure data structure.

Works well in analytics cubes since they can be combined.

You can retain them across time too, such that you can ask questions like "how many unique users were there over the last N days?" without needing the source data. Great for privacy-aware analytics solutions.

Love DataSketches but I was wondering if there is a way to compute datasketches across time for e.g. I want to compute the users who did X and then Y in that order. Since intersection is commutative it doesnt give an answer for time ordering.

Nonetheless the best data structure I have read over last 10 years.

I like https://en.wikipedia.org/wiki/Cuckoo_filter which allows for delete, unlike Bloom

I forgot about Aerospike. They basically built a NAND optimized key, value store right? I remember reading about how they used the FTL and thinking they were pretty clever. I cant for the life of me find the article now. I think they were really big in the ad tech space? Is that still the case?

"NAND optimized key value store" doesn't do it justice ;-) The fact that it's SSD optimized has nothing to do with key sharding across trees, the latter is what gets you the absurdly low latency and near infinite scale out. This link gives an overview: https://vldb.org/pvldb/vol9/p1389-srinivasan.pdf And it's open source...

I think what you’re describing in the last example is the MinHash algorithm

edit: actually I’m not sure, sounds related though

Sounds a bit like radix sort?

AIUI Radix sort "kinda ish" related, yeah :) They both involve looking at your data at a 90 degree angle and slicing on bitplanes.

HAMT: Hash Array Mapped Trie. This data structure makes efficient immutable data possible. You can update a list of a million items, and keep a reference to the original list, by changing 3 or 4 references and some bytes.

This should replace copy-on-write for scripting languages. I really want to see it in a JS spec soon. There are libraries that can do it, but they add translation penalties and extra steps. I’d complain less if I could use Clojurescript professionally, because it and Clojure are built around them.


What I like about HAMTs is that they can be super simple to implement if you make them one bit per level. They are like a combination of a binary search tree and hash table but without any of their annoyances.

* In binary search trees, you need to balance the tree every time you insert something because the tree will be linear if you insert the nodes in order. In a HAMT the positions are determined by the hash, so the positions will be random-like and not depend on the order you insert the nodes in.

* In binary search trees, you need to perform some complicated steps to remove a node. In a HAMT, you can either just mark the node as deleted, or if you want to fill the gap, find a leaf node that has the removed node on its path and just put it there.

* In hash tables, when it starts to get full you need to rehash it or put the added item in the "wrong" slot or something. A HAMT never gets full because it's a tree so you just add another child node.

Here I have written an explanation of a simple HAMT: https://www.robalni.org/posts/20220507-a-hash-table-you-dont...

Here is an implementation of a binary HAMT that allows you to add and find nodes:


Another interesting approach to copy-on-write for immutable collections (for example arrays) is where you actively mutate the array in place, but leave the original as a lazy description/view of how to get back to the before state.

From the outside the effect is the same, but the performance is optimized for accessing the updated collection and only possibly using the old value.

Great for cases where you want the immutability guarantee, but where it might be unlikely the old value is actually going to be used in the general case.

See also this cppcon talk by Juan Pedro https://youtu.be/sPhpelUfu8Q

Databases do that.

If the Record & Tuple proposal advances to stage 3 we'll finally have native immutable data structures in JS [1].

[1] https://github.com/tc39/proposal-record-tuple

I have beeen waiting aeons for this, and suspect will need to continue waiting effectively forever :(

Yeah TC39 has a habit of letting proposals languish at stage 2 for years. I wouldn't give up hope though. The decorators proposal finally made it to stage 3 after spending an eternity at stage 2. According to the slides from the TC39 meeting this month [1][2], it looks like they'll be proposing an advancement to stage 3 in September.

[1]https://github.com/tc39/agendas/blob/main/2022/07.md [2]https://www.dropbox.com/s/g4enjgd4p2npv2s/record_tuple_updat...

This. Roughly a year ago I got interested in efficient immutability for my write-from-scratch-in-C Lisp [0] and started to write a HAMT implementation in C [1], along with a (somewhat hacky, you have been warned) benchmarking suite [2].

The docs are only 70% done (in particular the "putting it all together" part is missing) but it has been a really interesting and enlightening journey so far and can only recommend embarking on this path to everyone.

[0]: https://github.com/mkirchner/stutter [1]: https://github.com/mkirchner/hamt [2]: https://github.com/mkirchner/hamt-bench

See also Matt Bierner's HAMT impl in JavaScript: https://blog.mattbierner.com/hash-array-mapped-tries-in-java...

also Ctrie: "a concurrent thread-safe lock-free implementation of a hash array mapped trie": https://en.wikipedia.org/wiki/Ctrie

So basically like HAMT but lock free.

Isn't this what Sublime Text uses under the hood to give it such good performance?

Isn't this slow because of pointer chasing in the trie?

I do recall Asami using a HAMT and it is written in Clojure. [1]: https://github.com/threatgrid/asami

That’s what Immutable.js used under the hood.

As a case study we evaluated the performance of this along with other languages that use immutable data structures (perf testing) for a recent production app we were building. After some evaluation we had the opinion that the penalty of using immutable HAMT's on the Node/JS platform is just too high, at least in its current lib form. Immutable structures are often somewhat slower, but the relative penalty of using a HAMT on the Node/JS platform vs a straight mutable structure was much higher than other languages. Floats for numbers, conversions back and forth implicitly, pop count intrinsic missing, conversions in and out, and other such overheads make the relative penalty of an immutable HAMT vs a mutable JS map much higher compared to other languages such as .NET or Java let alone C/Rust.

Given Node is mostly single threaded as well some of the advantages (not all!) of immutability diminish on that platform making the performance penalty harder to swallow, at least for my use case.

It of course may not matter to your case, but I find the benefits of these structures come from their quicker clone/copy time in many applications, snapshot like characteristics, easy large diff's, etc. For this to matter the collection has to be of a decent size where a linear copy/equals/etc is too slow to work for the use case. At that point speed starts to matter as the costs of random lookup start to grow/drift further away from a standard mutable map and for JS we deemed the penalty too high and decided on another backend platform amenable to writing generic data structures.

Yes, I evaluated it. The complexity of knowing where to convert from/to plain JS, plus the extra library syntax to learn, plus the performance cost of toJS, made it a poor fit for my particular use case. Nearly as much of a hard sell at work as saying “let’s rebuild the UI in Clojurescript,” and without providing as much benefit.

My use case is pretty atypical though, and it’s worth checking out if you have more reliable data shapes/update paths.

Does immer have the same drawbacks as immutable? It uses proxies as opposed to hased array map tries.

The key design difference between the two is that immutable.js wants you to keep your stuff in its format (and operate on your stuff with its functions) until you hit some piece of code that needs a plain JS object. Whereas immer is scoped (mostly?) to update functions, and always returns plain JS. Immer seems to be designed to simplify the problem of only changing references to changed objects—and it looks like it does a good job of it.

So with immutable.js, you get more powerful data manipulation throughout the app, at the cost of having to know when and where to convert things back to plain JS. With immer, you get tightly-scoped immutable updates to objects, and inside the update function you can treat them as if they’re mutable, letting you write more idiomatic-looking JS. Instead of spreading 4 layers of objects to update one state flag, immer lets you say:

  const nextState = produce(state, draft) => {
    draft.view.marketing.annoyingSubscribePopup = false
Every object reference in that path will be updated, but the rest of “state”, “state.view”, etc. will be unchanged.

If you can keep everything inside of immutable.js, it is the fastest thing out there. As soon as you have to drop back to plain JS, it gets slower. See this performance graph: https://immerjs.github.io/immer/performance/

Thanks for reminding me of this. We finally dropped IE11 support, so may be able to get some benefits from introducing immer either by itself or by bringing in Redux Toolkit.

I imagine careless use of such a structure would be an easy way to create a memory leak. Is it possible to create persistent collections in js, that will free data no longer directly referenced?

The careless usage of shallow copies of objects (with JS spreading) presents the same issue. Properly scoping assigned constants and variables is still important.

Persistent collections are available today via immutable.js, so it can be done. The catch is that you have to use a library for it, and transform them back into plain JS when something expects an object or an array. The language itself could make this transparent and lower-cost by implementing it at the engine level.

Persistent collections are primarily useful in functional programming paradigms. JS is a multi-paradigmatic language, so it doesn’t make sense to use them as the default. It would sure be nice to be able to opt into using them in FP-aligned frameworks like the current React ecosystem though.

Relatedly, RRB trees for immutable vectors with good constant factors.

Splay trees:

These are binary search trees, but when you 'splay' the tree (such as by search) it rebalances the tree so that the search result is on top. This means while it is O(log n) for operations like other self-balancing trees, it optimizes tree depth in the face of non-random access so that recently accessed items are fewer steps into the tree.

Piece tables:

A bit more common for text editors, where you need to represent a sequence of characters in a form that allows efficient memory use and fast insertions/removals (so editing the first line isn't moving the 10MB of data that follows it). You create a series of references to spans (or pieces) of the document, possibly starting with a single span pointing to a mmap() version. Edits are done by appending/prepending pieces, which are potentially just references to subsequences of items created in fresh memory, appended into a buffer. Saving can create a single sequence (and a single span).

This has interesting variations:

- Put attributes on the pieces for formatting, such as indicating text should be rendered a different color or bolded.

- Create a hierarchy of pieces-of-pieces. With formatting attributes you are then dangerously close to a DOM.

- Retain old copies of a piece table - since your original mmap() file hasn't changed and your changes are in an append-only buffer, those piece table copies provide undo/redo state.

Came here for splay tree. I’ve found some really powerful applications for this in low level database engine work.

Being able to incrementally rebalance a tree and keep the set of deltas small each time is really powerful when you are dealing with append only IO abstractions.

The VSCode Blog has a great article on Piece Tables talking about their adoption of that data structure https://code.visualstudio.com/blogs/2018/03/23/text-buffer-r...

Spatial hashing.

Say that you have data that is identified with points in 2D or 3D space. The standard way that CS students learn in school to represent this is via a quadtree or octree. However, these tree data structures tend to have a lot of "fluff" (needless allocations and pointer chasing) and need a lot of work to be made efficient.

Spatial hashing is the stupidly simple solution of just rounding coordinates to a grid and storing nearby objects together in a hash table. As long as your data is reasonably well behaved then you get many of the advantages of a quadtree, and it's completely trivial to implement.

TBH even quadkeys are a fun answer to OPs question, many people aren't aware of them.

Simple explanation:

If you have data with x y coordinates and you know the bounds.

To compute the quad key for a point:

1. The key starts as the empty string. (I've also seen it start with "Z" to handle points outside the bounds)

2. Divide the space into 4 quadrants

3. determine which quadrant the point falls in, append a letter (A-D depending on the quadrant) to the key

4. Repeat step 3 using the quadrant bounds (i.e. recursively smaller area) until you have desired accuracy

This can then be used to efficiently find points within rectangular bounds.

A lot of people don't know about the related segment and interval trees. Both efficiently allow you to compute all time ranges that overlap a point. This can be generalized to higher dimensional spaces.

For example if you have a set of calendar invites with a start and stop time (say tens of millions of them) and you want to find all invites which span 12pm (that is start before 12 and end after 12) you want an interval tree.


That's how mongodb geospatial indexes work IIRC

Was about to mention this. If I recall correctly, the 2d-sphere index rounds geospatial coordinates to 5 decimals. Very occasionally, I found it would distort polygon geometries just enough to cause them to become invalid (e.g. overlapping geometries), which causes the index build to fail.

In my recent experience working with collections containing million of documents, each containing a geoJSON-style polygon/multipolygon representing a property (i.e. a block of land), I found invalid geometries to occur for about 1 document in 1 million. For a while, I suspected the data-vendor was the cause, however it became more puzzling when other geospatial software confirmed they were valid. Eventually we traced the issue to the 2d-sphere index.

A very clever workaround was suggested by a colleague of mine, inspired by [1]. It preserved the original geometries. In each document, we added a new field containing the geometry's extent. A 2d-sphere index was then built on the extent field instead of the original geometry field. Invalid geometries were no longer an issue since we were dealing with much simpler geometries that were substantially larger than the max precision of the index.

When running geoIntersects queries on our collection of millions of documents, we did so in 2 steps (aggregation queries):

1. GeoIntersects on the extent field (uses the index).

2. On the result set from the last step, perform geoIntersects on the original geometry field (operates on a much smaller set of records compared to querying the collection directly)

[1] https://www.mongodb.com/docs/manual/tutorial/create-queries-...

Seems exactly like broad phase and narrow phase in games physics engine.

The same things get invented over and over again and named different things depending on the field. Sometimes it's not immediately clear that they are the same things mathematically.

It's tried and tested for sure, I first encountered them in 94 but I assume they're much older.

A little sleuthing shows that (TIL) they are an application of Z-order curves, which date back to 1906

Hmm... the quad keys end up looking very much like coordinates. In the regular coordinates, the most significant bit divides the space in half, then the next bit divides those spaces in halves etc...

You could get a single "key" by just having a sequence of bits with most significant bit of x coordinate, most significant bit if of y coordinate, second most significant bit of x coordinate, second most significant bit of y coordinate, third....

From a basic binary search, this is very intuitive. Your explanation was well written.

It should be noted that this solution completely fails when the data is anything like real world location data which tends to be clustered in cities, stadiums etc. The strength of quad trees is precisely the fact that the "Alaska" quadrant can have a single child for that crazy outdoors-man using your gadget, while New York can have a million children trees down to every house on the street.

I’ve always found Morton encoding to be a cool approach. There’s an old but good blog post about this: https://www.forceflow.be/2013/10/07/morton-encodingdecoding-...

This looks like a more modern implementation: https://crates.io/crates/lindel

I didn't think that would be an obscure data structure, but Locality-sensitive hashing [1] helps in so many cases. Like nearest neighbour search, collision detection, etc.

[1]: https://en.wikipedia.org/wiki/Locality-sensitive_hashing

Neat use of these is the "fast multipole method" for stimulating galaxies.

If you want to compute the gravitational force on a given star exactly you need to figure out the forces from all other stars and add them up. This will be n^2 and slow. But turns out you can get a good approximation by dividing up space with a kd tree and using the average force from each cell to represent the stars underneath it in the tree. This gets you to nlogn


Ever heard of geospatial hierarchical Indices?

E.g. Uber's H3 index or Google's S2 index.

I have a very optimized octree, but one thing I can not use it for is points in space. Mine is built with the assumptions that every object has a finite size, which is used to determine what level of the tree it lives in. Points fail because they have zero size.

I'm still looking for a good spatial index for points. I'll think about yours.

Are these similar to space-filling curves like z-order indices? https://en.wikipedia.org/wiki/Z-order

Some more such curves: https://web.archive.org/web/20220120151929/https://citeseerx...

A variation of this are dynamic spatially-hashed voxel grids, where the outer grid is spatially hashed and the grid elements are stored (as needed) as dense voxel grids. The advantages of this are that you get voxel grid-like behavior over essentially unbounded size, so long as the data is sparse (otherwise you would need to allocate a significant number of grid elements as voxel grids and the memory savings go away).

These data structures have been used in 3D reconstruction methods like CHISEL where they can often outperform octrees due to better memory access behavior.

Couple some spatial hashing with Morton codes and fast parallel Radix sorting and some binary search derivative and you can do all sorts of fancy queries for collision detection and graphics stuff.

Yeah, just reinvented it a few weeks ago and was like “whoa, that’s it?” It worked perfectly for my fast clustering use case.

> reasonably well behaved

Curious what this means in this context. No hot spots? Not sparse?

Performance degrades when you get many hash collisions, which will happen if you have a lot of points clumped together in space.

Sparsity is fine.

For points you expect to be unevenly distributed the more advanced spatial partitioning data structures (kd tree, BVH) perform better.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact