Hacker News new | past | comments | ask | show | jobs | submit login
Ristretto: A High Performance, Concurrent, Memory-Bound Go Cache (dgraph.io)
215 points by ngaut 24 days ago | hide | past | web | favorite | 55 comments

This is a seriously great write up. So many links to research and papers. There were some truly new-to-me concepts and they implemented some stuff that are optimizations that I've not had to deal with yet despite the scale with which I work; it is fun to read about those. Seriously awesome work -- I would love to optimize as much as they have. Two things stand out. One, they could maybe improve thrashing at full capacity by evicting more than one key when at capacity (say, evict n% of total keys). Two, my reading here suggests this is not production ready and dangerous to use.

> ... then used this hash as not only the stored key but also to figure out the shard the key should go into. This does introduce a chance of key collision, that’s something we plan to deal with later.

If a Get and Set can produce a collision and it is not dealt with, then it is unsafe for production. Imagine calling user foo's data and returning user bar's data instead. This could leak PII and would be an unnoticed bug. Generally, in collisions, you drop down to a list and hurt your runtime to ensure accuracy. I need a correct cache over a fast cache.

Unless I misunderstood, this is not production ready. Again, amazing work; it is just not done yet.

> by evicting more than one key when at capacity (say, evict n% of total keys)

Ristretto does that already. It evicts as many keys as needed to stay within the MaxCost. This is done outside the critical path by background goroutines. However, you also don't want to evict too many keys, because that would affect the hit ratios.

Very cool!

> If a Get and Set can produce a collision and it is not dealt with, then it is unsafe for production.

That was confusing to me as well in the article. However they also wrote:

> If however, a key is already present in the cache, Set would update the key immediately. This is to avoid a cached key holding a stale value.

So it sounds like inserts to the cache might not immediately apply, but overwrites are guaranteed to invalidate stale data in cache.

The problem isn't stale data, the problem is getting data for an entirely different key. As an extreme example, if your terrible hashing algo was 'int(key) mod 10', then you have no collisions on keys "3", "45", and "99". However, if you asked for "19", you would get back the data for "99". If that data is their email address, name, address, or medical prognosis, then you just leaked PII when that data is served to a user. Worst of all, you wouldn't even know unless the customer calls up wondering why they got incorrect data.

We will fix that in the near future. Known problem with our design choice. Wanted to ensure we put it out there first, before solving it. FWIW, note that the probability of a collision with 64 bit hash is pretty low. Still, needs to be solved.

I use 128 bit spooky hash for keys in a system with ~1M keys in cache at any given time, give or take.

Rough numbers but maybe 75% of those keys existed last month and will still exist next month and probably next year, too, and 25% will exist for minutes to hours and likely never be seen again.

Collisions are uncommon, but a regular enough occurrence (dozen or two a year).

Could this be solved by some sort of cuckoo hashing?

The benefit of current hashing algorithm is that Go team is maintaining it ;-). Its written in assembly and runs in single digit nanoseconds.

One way to check for collision would be to either store the key as well or use another hashing algorithm (which would further decrease the probability of collision, but not eliminate it). Either would introduce some slowness, because every Get would need to be checked.

Storing the key with the value is a solid workaround.

Wow. That is a great write-up. I was admittedly skeptical of "writing your own cache" but these guys are super-OCD and obviously capable of producing best-in-class implementations. Kudos.

...kind of wondering, they had so at least three "well, go doesn't do ..." (lock-less data structures, thread local storage, and channels aren't high perf) rabbit holes that I wonder if go was really the right implementation language for their project (iirc they've also had articles go by about writing best-in-class go versions of rocksdb/etc.).

But, they do seem very capable of rolling their own infra, and I suppose someone has to do it to bootstrap ecosystems in each new/upcoming language.

Thank you for the kind words!

You're right about our rabbit holes. Our initial "why" for this project was essentially: "Caffeine is written in Java with excellent performance. Go should be (and usually is) competitive with Java. Why is there no Go equivalent?"

With such a high bar we were disappointed in naive, idiomatic Go solutions, but we knew there had to be some way to close the gaps.

I look forward to continuing to do so. We have a few things on the roadmap such as adaptivity (see Ben Manes' work for more info) and managing our own memory.

Thanks! If not Go, we would have used C++. But the simplicity and readability of the Go codebase is a huge win. Also, C++ is just not as much fun to write in.

I joke that Go is kind of like the wild West. It comes with its Benefits and own unique challenges.

TinyLFU is kind of magical--most of the win you'd get from expensive tracking of key frequencies, in much less space. The bloom filter in front seems like a clever hack.

Not an expectation/criticism/etc. of Ristretto, but out of curiosity, does anybody know of work about making the cache aware of the expected cost of a cache miss for each key?

To spell out what I mean: if I'm sending a query result to the cache, I could also pass along how long the query took to run. Then, in theory, the cache could try to drop keys with low (expected number of fetches * expected cost per miss), not just low (expected number of fetches). Of course that's rough/imperfect, and for some caching scenarios the miss cost doesn't vary by key.

I guess mostly the question is if it's enough of a help to justify the metadata+complexity, which I guess requires a trace of gets/sets from something that has miss costs that vary.

The intuition is just that now that we're pretty good at optimizing hit rate, maybe expanding the problem to "minimize total miss cost" could be an interesting direction for the use cases where there's a difference.

There is not a lot of research in this area and I briefly outlined my understanding of the current state when responding to a github issue:


Oh, fascinating. My intuition re: size was that you'd look at worth per byte for something like query results, a whole different kind of thing from the CDN use case that's been studied.

Makes me wish I could justify adding instrumentation to our 'memoize' wrappers at work, although our caching doesn't really match the use case for something like Ristretto or Caffeine. (Little enough stuff and the latency of a networked cache is fine for us, so "just set aside lots of RAM on some cache servers" works well enough.)

That would indeed be an interesting research area.

Good work! Caching is one of the 2 hardest problems in Computer science. Not because building a cache is hard, but cache invalidation/replacement is complex when you consider the system as a whole. Integrating the cache and preventing user from seeing the stale value is not a trivial task (if you want to support highly concurrent access). MySQL removed query caching from 8.0 because of the mess that query caches are. Enforcing transaction isolation levels in a database with caching is non-trivial.

Exactly. Doing caching in a transactional MVCC system is hard. We plan to do careful data caching in Dgraph with this. Block level cache in Badger would be easy. So would be key with version level cache in Badger.

Disclaimer: I admit to only reading the blog post and not looking at the code yet.

I really enjoyed this write-up as well! But I couldn't tell from the post: does it have a distributed mode like groupcache? (I think the answer is no). I think nearly all decisions in groupcache are because of the word "group" :).

Similarly, it would be great if there's a "safe mode" instead of the (IIUC) "start dropping Set requests when overloaded". You'd naturally want most applications to be prepared for your cache to get blown away, so you should be prepared for it to be lossy, but I'd certainly want a knob for that :).

By using a cache with an eviction policy, you sort of already accept that the data you write at "t" might not exist at "t+1". So that solves the semantics and expectations part.

For the part where an overloaded cache drops Set requests, the most frequent Set requests are the one most likely to be admitted in between the dropped ones, so I guess this is also sort of taken care of.

I have no affiliation with the project, but reading and asking myself the same question as you, I reasoned what I wrote above.

Glad you liked it!

> does it have a distributed mode like groupcache?

Not at the moment, no. Though it could, theoretically, be used within groupcache. We'll look into that as it has been brought up a few times now.

> it would be great if there's a "safe mode" instead of the (IIUC) "start dropping Set requests when overloaded".

We could definitely look into adding a config flag for that. I'd love to know: in what situations would lossless Sets be important to you? We already have guaranteed updates, so the only Sets that are dropped are new items.

Also, if we guarantee that every Set is accepted, it goes against the idea that the Estimate of the incoming key must be higher than the Estimate of the key getting evicted.

Of course, one way to ensure every key goes in is to have a cache of infinite MaxCost. But, that’s an orthogonal idea, I guess.

Ahh! I missed that only new items are dropped, while updates are always accepted. That avoids the “umm, I have stale data” issue. That was my main worry!

Here is the code for groupcache:


There is no "distributed mode" as far as I can see. It is an in memory library for a K/V cache that you use in your code. That's it.

The description is somewhat clumsy (I’ll have Bradfitz update it) but “peers” allows for distributed. There are lots of things I don’t like about groupcache (e.g., immutability / need to version) but it is where the “group” part comes from.

Can't believe I missed that! Thanks for clarifying that.

Kudos on the informative blogs. It's always better when engineering teams share their knowledge ;)

In setting up the cache, can I assign values to be arbitrary length, binary safe strings? Can this be used to store indexed blob data of large size? Not saying this is a good idea btw...

Values can be arbitrary length, yes. We plan to store SSTable blocks from Badger in Ristretto. They can be strings, byte slices, whatever. That's the reason we chose interface{} as the input.

Excellent! thnx so much for building

Does anyone know of a similar cache that lets you restrict memory usage by “customer” for example? I used redis scripting to implement this use case in redis, it really feels like a hack though.

Why are you sharing caches between customers?

Seems like you would just instantiate separate caches per customer if that's what you're looking for.

Caches generally sit above the auth system. Sharing them between customers is a catastrophe waiting to happen.

We spend a lot of time here in postmortems asking what they hell were they thinking. This is a great opportunity to ask pre-catastrophe.

> we used the natural randomness provided by Go map iteration to pick a sample of keys and loop over them to find a key

The iteration order of Go maps is randomized, but it's far from being evenly distributed (it could be good enough for the Ristretto use case though). Here is a test that shows that some keys are 6 times more likely to be selected on the first iteration https://play.golang.org/p/dT1CWuqoHEM.

Based on this behavior, the two scenarios for Ristretto would be:

1. If the chosen sample keys have high Estimate, the incoming keys could be rejected, affecting the Hit ratios. However, our Hit ratio benchmarks didn't show such behavior (within 1% of exact LFU).

> With this approach, the hit ratios are within 1% of the exact LFU policies for a variety of workloads.

2. The chosen keys have low Estimates, in which case they'd be removed and hence, won't suffer from repeated selection.

So, yeah. Doesn't affect Ristretto. But, good thing to keep in mind for other workloads.

This is really nice. The sync.Pool based ring buffer is a really clever way to get around the thing that always kills my low latency work in golang.

I wonder how much the type casting costs you’ll see in real world situations? I’ve not benchmarked that generally but it’s always something I worry about with ‘generic’ data structures like this.

Thank you. The sync.Pool lossy ring buffers are probably my favorite innovation, especially because of how little code it is.

I was worried about type casting as well and generally avoided bare interfaces like the plague. However, in all the benchmarks I've ran for throughput, type casting was a small percentage of CPU time. I'll see if I can dig up a flamegraph.

Of course, the only reason sync.Pool performs so well is because of its internal usage of thread local storage. If the Go team exposed that, we could make our own typed sync.Pool implementations... I won't hold my breath, though.

Question about the buffered writes and invalidation: is there anything that prevents race conditions between a Del() and a prior Set() on the same key? Just glancing at the source, it looks Set() ends up in a channel send, whereas Del() seems to synchronously update the internal map.

I am not familiar with the Go code, but in the Java version (Caffeine) this is handled by a per-entry state machine. The entry is written to the Map first and the policy work buffered for replaying. If the replay occurs out-of-order (e.g. context switch before appending causing removal=>add), then the entry's state guards against policy corruption. This may not be necessary in their case due to using a sampling policy and other differences.

Feel free to file an issue. That could cause correctness issue in Ristretto as of today.

This issue would fixed with this PR: https://github.com/dgraph-io/ristretto/pull/62

Great write up and tool, can't wait to try it out in my projects. Enjoyed your last write up as well!

Go is usually used for microservices, and for that scenario, Ristretto is not applicable since it's a library, and not a "shared server" like Redis. Am I right?

Very much to the contrary, why go over the network to get a value from a key-value store when you can have a mutex protected map that only hods the values that where accessed by that node. The "lets have a shared memory between our services" architecture makes things easy only to make it much harder again.

I’ve been trying to convince my coworkers that we should have a cache per node, but so far they aren’t biting.

Alternatively I think within the decade we are going to expect every service to contain its own fault prevention logic instead of just the web tier. At which point every service has its own reverse caching proxy in front of it.

Microservices are long lived and having an in-memory cache can alleviate network calls and improve response time, depending on the kind of work being done. A shared cache like redis requires a network call obviously. One of our caching libs does just that: check memory, then check redis, then do the expensive data collection from source.

If you mean will it persist across restarts, then no. If you mean does it have a built in network protocol, then no.

But neither of those are requirements for micro services. I certainly will try this out the next time I need a cache in my microservice.

One way to make Ristretto act as a server would be to bake it into Group cache. Which is the equivalent of memcached in Go.

Wow, I was just researching go caches like a week ago and read your previous summary of the state of the world. Didn't realize this had been posted. Excellent timing.

Oh, same name as the Xfce image viewer: <https://docs.xfce.org/apps/ristretto/start>

Name collisions are unfortunate but inevitable. This name makes me think of ristretto255 (https://ristretto.group/), but that’s the world I live in. I suspect this project is likely to see more name recognition over time than either of our examples.

indeed, I thought this was a post about the crypto stuff :c

As we know, two most complex problems on software development are caching and choosing right names. Well, here we have an example of combination of both)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact