
Efficient cache for gigabytes of data written in Go - alexellisuk
https://github.com/allegro/bigcache
======
merlincorey
> BigCache does not handle collisions. When new item is inserted and it's hash
> collides with previously stored item, new item overwrites previously stored
> value.

I'm not sure what hashing algorithm it is using, but this seems like a pretty
undesirable property.

The code for the hash function is found here:
[https://github.com/allegro/bigcache/blob/f64abe8f4fe2f5769bd...](https://github.com/allegro/bigcache/blob/f64abe8f4fe2f5769bd9bef5a2944a83ea673e7e/fnv.go#L19-L28)

    
    
        // Sum64 gets the string and returns its uint64 hash value.
        func (f fnv64a) Sum64(key string) uint64 {
            var hash uint64 = offset64
            for i := 0; i < len(key); i++ {
                hash ^= uint64(key[i])
                hash *= prime64
            }
            return hash
        }
    

The referenced `offset64` is `14695981039346656037` and the `prime64` is
`1099511628211`.

I can find a reference to a similarly named `Sum64` function in
[https://godoc.org/blainsmith.com/go/seahash#Sum64](https://godoc.org/blainsmith.com/go/seahash#Sum64)
which indicates SeaHash is a non-cryptographic hash function and further
considers `Sum64` to be a checksum function.

I'm guessing there's a lot more collisions possible here than otherwise
expected.

~~~
meritt
It's using FNV-1a [1] which is indeed a non-cryptographic hash function and is
a perfectly suitable choice for a hash table [2]. The goals are speed and low
chance of collision, not security.

[1]
[https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo...](https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function#FNV-1a_hash)

[2]
[https://softwareengineering.stackexchange.com/questions/4955...](https://softwareengineering.stackexchange.com/questions/49550/which-
hashing-algorithm-is-best-for-uniqueness-and-speed)

~~~
merlincorey
Thank you for identifying the hash function in specific - that makes sense why
it was in `fnv.go`!

My point is more to do with the fact that BigCache does not handle collisions
at all, whereas the built in Go `map` almost certainly does.

Essentially, I can imagine that a Jepsen-like test would reveal that BigCache
loses some percentage of writes (dependent on the data, of course).

Whether or not that is a problem depends entirely on your use case, of course.

~~~
meritt
While caches and maps share many of the same underlying properties, they serve
very different purposes. Caches generally come with configurable size limits,
eviction policies, and collision handling (overwriting is a simple solution).
They are in no way designed to be durable data storage mechanisms.

~~~
kbaker
So does it just serve completely incorrect data when a request for the 'old'
cache key come in?

Seems like this would always be undesirable behavior for a cache?

It is definitely being advertised as a map-like structure, key goes in, bytes
come out...

~~~
archgoon
Haven't looked at the full implementation of BigCache, but typically, a cache
would fetch the object and key from the data structure. If the full key
doesn't match the requested key, then the cache will respond that the key is
not in the cache. The calling function would then figure out whether to
replenish the cache or not.

Update:

Yes, that appears to be the strategy in shard.go.

    
    
        if entryKey := readKeyFromEntry(wrappedEntry); key != entryKey {
            s.lock.RUnlock()
            s.collision()
            if s.isVerbose {
                s.logger.Printf("Collision detected. Both %q and %q have the same hash %x", key, entryKey, hashedKey)
            }
            return nil, ErrEntryNotFound
        }
    

so there is no risk of returning the wrong data for a key.

~~~
kbaker
Ah OK, that's good, just was reading too much into this section in the README:

> BigCache does not handle collisions. When new item is inserted and it's hash
> collides with previously stored item, new item overwrites previously stored
> value.

------
cyri
Have a look at, also in Go written, [https://github.com/dgraph-
io/ristretto](https://github.com/dgraph-io/ristretto) which out performs
bigcache and freecache (see the colourful graphics)

~~~
SergeAx
Alas, Ristretto implements an eventual consistency.

~~~
mrjn
Feel free to file an issue if that's a concern -- if enough people feel the
same way we could provide an option to "synchronous consistency". It wasn't
one in the use cases we looked at.

~~~
SergeAx
It would be a significant performance tradeoff, if I understand correctly. So
there's no point, there are enough ACID in-memory databases in the world)

~~~
mrjn
Most likely not a performance tradeoff. For every Get and Set, we do lookups
anyway. Caffeine, for comparison, always sets the key-value, and then if the
key doesn't get admitted due to policy, deletes it later.

------
reinhardt1053
Blog post that describes the implementation:
[https://allegro.tech/2016/03/writing-fast-cache-service-
in-g...](https://allegro.tech/2016/03/writing-fast-cache-service-in-go.html)

------
shepardrtc
This is very cool. But why didn't they just go with Redis? Their blog stated
that this requirement: "be very fast even with millions of entries" wasn't met
by Redis, but I have a very hard time believing that.

~~~
KirinDave
Because Redis is... problematic?

1\. Redis has a lot of functionality that a simple cache client doesn't need.

2\. Redis's connection model can lead to complications.

3\. Redis's to-disk checkpointing in practice uses a lot of memory.

4\. Redis's poorly chosen default settings have cost the industry an uncounted
but large sum of money.

5\. Redis is written in C. That's a bad idea for a networked application.

6\. Redis's creator is a person who doesn't deserve our support. He's
constantly combative with experts who have give him good advice about how to
improve Redis because he has a vision of "simplicity" which translates to
"what I already understand."

Redis is a decent choice if you need all of its features. It's got a wide
spectrum. But, if you don't need ALL of them, then pick a simpler and better
designed system.

~~~
sagichmal
> Redis's creator is a person who doesn't deserve our support.

There are valid criticisms of Redis but this is shitty and vindictive. You
should be ashamed.

~~~
tptacek
I don't love the way the previous comment is written, but it's substantive.
This is just drama; it distills the worst possible reading from the parent
comment and tries to fix that meaning for the rest of the thread, which is
exactly the opposite of what the guidelines ask you to do.

Further: the Redis take on display in that comment is pretty mainstream – very
much including the statement about Sanfilippo's obstinacy – among systems
developers. Even if you're an advocate for Redis, it's good to at least see
the brief its detractors bring against it.

~~~
sagichmal
I agree that Sanfilippo can be obstinate. "Redis' creator doesn't deserve our
support" is a shitty and vindictive conclusion to come to from that
obstinance.

If that's ^^ unsubstantive drama to you, and the thing it's responding to
isn't, calibrate your sensors because they're off.

~~~
KirinDave
I don't really understand why this is so traumatic to say. There are lots of
people doing amazing things. Maybe we can look at projects run by people who
don't disregard good technical advice for years out of what they themselves
have called pride.

I knew my post would be controversial, but I certainly wasn't expecting that
the main complaint leveled is that I'm supposed to ignore he and his
community's prior transgressions because remembering them is "vindictive."

------
todd3834
My understanding is that global variables in Go are not garbage collected. Is
that true? If so then is this not creating the cache struct as a global var
and therefore this cache will never be garbage collected?
[https://github.com/allegro/bigcache/blob/master/server/serve...](https://github.com/allegro/bigcache/blob/master/server/server.go)

I ask this because I’ve run into a problem with this in the past.

~~~
hu3
The struct pointer itself wont be GC'd. It's intentional as the cache should
be up for as long as the server process runs.

However, that struct contains data that can and will be GC'd once no longer
used. This is the type of the variable:
[https://github.com/allegro/bigcache/blob/master/bigcache.go](https://github.com/allegro/bigcache/blob/master/bigcache.go)

~~~
lainga
The mantra I'm always told around slices is "Don't keep pointer slices. Don't
keep pointer slices." Because the entire thing will fail escape analysis and
go onto the heap. But it's OK in this case, right? Because, as you say, the
BigCache (including the []*cacheShard) is allocated once; never gets GC'd; and
it's around forever.

------
gigatexal
None of these projects or exercises are futile: something is learned in the
process, and the language gets pushed into new areas, algorithms get proved
out etc but it seems that Go is becoming the next hammer and every problem in
the problem set has become a nail. Personally I would have just stuck with
Redis

------
correct_horse
All I can think of when I hear about Go now is garbage collector ballast, a
hack for which Go has no proper solution see
[https://blog.twitch.tv/en/2019/04/10/go-memory-ballast-
how-i...](https://blog.twitch.tv/en/2019/04/10/go-memory-ballast-how-i-learnt-
to-stop-worrying-and-love-the-heap-26c2462549a2/) and
[https://github.com/golang/go/issues/23044](https://github.com/golang/go/issues/23044)

~~~
awestroke
That's insane. My impression of Go keeps getting worse.

------
deckarep
I have not used this package but there is another one for Go called freecache
that offers a similar solution. The primary benefit of such solutions is that
if you have a Go service where you want to rely on the local caching of data,
using a package like this can _significantly_ reduce GC scan pressure removing
the work that Go needs to do when analyzing pointer data sitting on the heap.

Why does this help? Instead of keeping millions upon millions of items alive
on the heap (leading to Go having to scan all such data) you can instead
serialize/deserialize the data in a solution like this with usually minimal
overhead. Storing your cached data in a solution like this suddenly gets rid
of the need to have live pointers of data on the heap. This is because your
data is now stored as []byte slice somewhere in the Cache data structure that
this code uses. Finally, since packages like BigCache/Freecache are built with
only a very small handful of heap objects the Go runtime performance can now
go back to what it does best which is spending most of its time in your
application logic.

If anyone has any doubts of this approach try it out...we saw dramatic
differences with using a package like this vs a naive map of pointer based
data or vs something like Hashicorps LRU datastructure.

The last service we applied this model changed CPU profile from running at
around 900% to 400%. That was a big win in my book and practically cut our
cluster size in half.

~~~
gfs
> Storing your cached data in a solution like this suddenly gets rid of the
> need to have live pointers of data on the heap.

So if I understand this correctly, Go does not do any recursive scanning of
structures? Each unique bit of data owns it data for as long as it needs to?

~~~
deckarep
Not quite, the difference here is that instead of storing live objects on the
heap you would need to serialize them into a byte slice. You then hand that to
the cache library and it would find a place to copy those bytes and store them
on your behalf. This means your object now just becomes data belonging to the
cache and sits somewhere on the libraries internal data-structure.

