ASAIK, he wrote the original memcached in perl but what everyone is using now is a rewrite of memcached in C by Anatoly Vorobey [1]. Not that it diminishes his efforts or anything, just pointing it out.
And then I rewrote it in C++ for App Engine and other users within Google (Blogger, Plus, hundreds others).
I also then ported the C++ memcached (called memcacheg) to Go, for profiling Go vs C++, and named that one memcachego. So I've written it about 4 times.
"I've written it 4 times, and I've gotten exceedingly good at it" :)
Kidding aside, anything you learned during the rewrites that caused noticeable changes to the design, or it's still the same thing, just transliterated?
And if stuff doesn't get written in Go, then people will criticise Go for not having anything in production. The internet is not going to run out of space through people writing too much software.
groupcache has a couple of differences from memcached that are probably part of why it exists:
- thundering herds suck, so it manages how the cache is filled when a key is missing
- repeatedly retrieving super hot values will eventually exhaust the network pipe of whichever server held, so groupcache replicates hot values to more than one cache server
Since this was built for dl.google.com, the 'value' might be big (say, a chunk of the Chrome binary) and the request rates for the top cache keys probably get pretty high. So, even if memcached is doing good work for lots of folks, the distinct features of groupcache probably address real needs in the app.
And, yeah, groupcache and the dl.google.com rewrite are Go partly because Brad likes Go, but both projects had reasons to exist quite apart from deploying Go for Go's sake.
I'd be curious to hear performance numbers (assuming a reasonable front-end server to this library). I get it that the replicated in-memory caching part is valuable. But (from painful experiences with Java) I also fear that a GC based memory management system is anti-optimal for an in-memory cache of small objects, especially as the size of heap grows to beyond a couple of GB( * ).
Roughly: a GC has to look at the pointers in use to see what objects are still reachable. But it doesn't have to look at the contents of strings, byte slices, int arrays, etc., because, by definition, those don't contain pointers.
For dl.google.com, the first user of groupcache, it sounds like the Go runtime was working with relatively few, relatively large chunks of memory (think 2MB chunks of a Chrome binary). Since that doesn't have so many pointer-containing objects, just big chunks of inert data that don't need scanned, it's easyish. On the other hand, something like throwaway2424's case could be a GC stress test--many gigs of RAM holding a network of small, pointer-filled objects.
Really, the tl;dr may be to prototype/measure if you need to know how a GC will do. Assumptions are easy; data is hard.
Ahh got it. Also, the reason Brad specifically mentions big arrays of bytes/strings is that they don't look like pointers. Soo.... I'm guessing you still want to stay away from huge arrays of ints (or whatever type has the same size as pointers on your platform). Right ?
And as for measuring... it's not cheap (developer time wise) to build a realistic enough model of your application and put realistic enough loads on it for long periods of time to test out each of your hypotheses (and GC theories do require long periods of load). So you have to narrow the space of choices informed by past experiences and sometimes by gathering semi-reliable folk lore. I'm currently engaging in the latter activity :-)
++ablankst that byte isn't special here. Brad singled out []byte/string because those are the types stored in groupcache (see ByteView in the groupcache code for his sneaky tricks to make them look the same to other code).
And yes, unfair to ask someone to test every guess they have. But if you do want to know a bit more about what effect GC and Go would have, experimenting with toy programs isn't a bad way to get a feel.
In Go, and other type-safe languages, the compiler usually knows exactly which ints are being used as pointers. This is the difference between "precise" and "conservative" GC and is one of the reasons that comparing GC performance in C++ to GC performance in other languages is difficult.
I was under the impression (again, folk-lore, mailing lists etc.) that Go had a conservative collector. I can't find an official documentation link that'll tell me if it is or not at the moment.
Atom also says 1.0's was more conservative, but, as Brad also said, still didn't scan "objects such as []byte" (meaning all plain-old-data arrays? who knows). The Go 1.1 Release Notes mention the collector becoming more precise, which was a particular issue on 32-bit because big heaps could span a lot of the address space.
At some point, this sort of discussion probably gets you less useful info per unit effort than just playing with a Go distribution, trying out whatever toy programs you find interesting.
Go's GC isn't nearly as powerful as any of the many GC implementations for the JVM, but Go's approach is also quite differently. In Go, you can reduce / avoid the garbage that you create. And when the GC turns out to be the limiting factor of your cache, then you can also mmap some dedicated pages from the OS and manage the memory on your own. But I guess that won't be necessary. Do you have any issues with Go's GC today?
Well I don't have issues with Go's GC today because my horrific experiences with various JVM's has been a powerful deterrent for any future experiences in performance critical large-heap server processes with a GC controlled heap. Other than that, I really really like the design of Go. This is in addition to the factor of my faith in the solid team and company backing the language.
I'm quite interested in the two mitigations you mentioned. Especially the "manage memory on your own option" ? I was not aware that there was a go language (that is, not a C extension) blessed way to do such a thing.
As for "reduce/avoid" creating garbage, how can I achieve that for a large heap caching application (like the one we're talking about here) ?
In any GC'd language, the less you allocate the less you GC, so "recycling" old objects when you need new ones can help. So where you'd normally make() a []byte, you can instead get one from a list or buffered channel you made of old discarded []bytes. So that gets you back to malloc/free/new/delete, but you can at least do it only for the most performance-relevant allocations in your program and keep being lazy everywhere else.
Another sometimes-useful way to save allocations--not for a cache, but in general--is just to turn short-lived allocations into longer-lived ones: if fooBuf is used and thrown away by each of several calls to obj.baz(), then make a single fooBuf and store it as a (private) field on obj, assuming that doesn't present thread-safety or other problems in the specific context.
There are certainly things you wouldn't use Go for, but now you know more about reducing memory pressure in GC'd languages. :)
Well various techniques to recycle/reuse allocations instead of "garbaging" are well known techniques from the Java world too. Also, the generational collectors popular in the JVM's also do a good job of lots of short-lived allocations (as long as a vast majority of them die young).
The problem is, none of those things help when you have a large cache of small(ish) objects that need to be scanned fully at every GC cycle. There's nothing worse for your performance than stopping the world for a few seconds and using that time to wipe your L1/2/3 caches with completely useless data. That's the reason I called large-heaped caches the anti-pattern for a GC'd system.
Brad's answer about using large byte array allocations (and presumably, managing the smaller chunks manually) is a fair enough answer to this question. But of course in this case you'll be writing code to manage small allocations out of that big buffer yourself, thus negating the whole point of GC. But it might be a decent enough compromise if you like the rest of the language a lot (which Brad clearly does :-)
The easiest way would be to store everything you can on the stack. The Go compiler does a very good job at escape analysis and tries to put everything on the stack that does not escape. Therefore, it's often better to use output parameters instead of return values to reduce (or eliminate) the number of allocations.
Some careful thoughts about the memory layout of your structs, especially which of them should be embedded and / or passed around by pointers and which of them shouldn't, might also pay off.
Another common optimization is to put the allocated objects back to a memory pool for later use. Take a look at the bufCache channel [1] from the bufio package for example (the http and the json package are using the same trick).
Go's garbage collector sucks pretty hard for large heaps. A 16GB heap that's mostly live objects will take tens of seconds per round, during which no goroutines will run (it's a stop-the-world collector). Basically it's 15 years behind Java in this regard.
I couldn't find in the code where the objects are actually stored, but theoretically there's nothing stopping you in Go from allocating yourself up an array of X bytes (for X in the mega or gigabytes) and completely managing it yourself, leaving the GC merely for incidental activities, which themselves can be minimized through various design patterns and some care.
It is true that you should describe Go as a garbage-collected language, but unlike a lot of other such languages, Go has C-like mutable arrays readily available. In practice it's more like a sort of hybrid, in that even though it's garbage collected you have a lot of opportunities to write code that still doesn't really use it, without having to "drop down to" C or something. I still hope to see some improvements in it before I could make a big commitment to it, but I've certainly got my eye on it.
>I couldn't find in the code where the objects are actually stored, but theoretically there's nothing stopping you in Go from allocating yourself up an array of X bytes (for X in the mega or gigabytes) and completely managing it yourself,
Nothing stops you, except:
- The determination NOT to do something the computer can do
- Previous exposure to a proper, modern GC
- The fact that you used Go to get away from this kind of shit in the first place
I did not mean this is a technique you'd use in general. I meant specifically in the context of groupcache, where an array of bytes that you manage yourself is a reasonable approach to the problem.
If you don't want to do that, don't. I'd never start the prototype with that functionality, for instance. But when you discover that you need it, Go permits it in a way that Python or Perl do not. Go is GC'ed and you are free to use that to the extent you want, but it's easier to escape from the GC than it is in many GC'ed languages. It may not be obvious from a casual reading of the spec, but there's a lot of ways around garbage in Go. Not like, say, Rust, and certainly not the same level of Raw Unstoppable Power as C++ (at a corresponding huge complexity cost), but it's a very interesting middle ground.
Note that if you were writing groupcache in Java, you'd probably end up doing the same thing and just allocating an expanse of bytes which you manage yourself.
I've often heard people say things that imply that "proper" GC's don't behave like whatever badness I'm experiencing currently. Can you give me an example of a Garbage collector that I could safely use for a large heap of small cached (ie. long lifetimes) objects without a significant throughput penalty and with deterministic latencies (ie., no random full system pauses) over, say, a tcmalloc based hand-allocated/free'd system ?
A 16GB heap that's mostly live objects will take tens of seconds per round
The situation you describe, as you describe it, is (no longer - was it ever?) a pathological case. I run stuff with 12-20GB active objects and a few dozen http hits a second and the only times I get a latency over 100ms is when I reload/refresh some data from disk (1.5GB of gob), then it goes to 150-250ms typically.
Basically it's 15 years behind Java in this regard this is fanboysm or just ignorance.
It does not matter how sophisticated GC code is when you have millions of tiny objects from from crappy Java code with useless copying and referencing. This is why it uses so much memory.
It is very difficult to find any worse thing that Java in terms of wasting of memory. Take any Hadoop none and measure the ratio between memory used by Java processes and amount of actual data stored in memory. Near factor of 2?
>It does not matter how sophisticated GC code is when you have millions of tiny objects from from crappy Java code with useless copying and referencing. This is why it uses so much memory.
Or, you know, you can avoid writing "crappy" code. Whereas with Go's GC, even proper code will have problems.
You are arguing something that is beyond discussion: Java's GC is better than Go's.
Or, you know, you can avoid writing "crappy" code. - In Java - hardly. It requires a discipline and skills beyond capacity of a typical developer. This phenomena could be called "memory is cheap and JVM will handle everything".
Even in Clojure code, for example, they implement a node of a binary tree as a hash-table which is a meaningless waste. Position based selectors would be good enough.
Java's GC is better than Go's - yes, in theory, on paper. In practice well-written Go code could outperform typical Java code, even with less sophisticated GC.
Of course, I cannot prove that, but there are some intuitions to support such claims.
I really wouldn't call this an alternative. If you are running memcached, it's very unlikely you can switch to Groupcache.
Parts of your application may rely on the expiration feature. But the biggest change is the inability to overwrite a current cache key. Every application I've used does this constantly (object updates).
Groupcache in its current form is useful for a very narrow set of applications.
In Rails 4 with russian doll caching, when data or the view changes, the key changes, so rather than expiring keys, you just don't use them again. So you have no real need for expiry or overwrites, as when data changes, the key changes with it (I think they use the object modified date + hash of view template). As far as I can see groupcache would be compatible with that caching scheme, or others like it.
The only bit which could be tricky is if you want to clear the cache, I don't see a way of doing that from a quick look at the API. But to clear the cache you could have a special version key hashed into each key whose sole purpose is to empty the cache, which you bump whenever you want to remove all keys at once for whatever reason (in a perfect world, this shouldn't be necessary of course).
The intention with Russian doll caching is that the cache uses LRU, so you don't need to worry about clearing the cache. No poison pill needed. Memcache operates this way.
Given that there's an 'lru' folder with a package that says 'implements a LRU cache' as the first comment, I assume this works that way as well.
Yes in a perfect world you never need to clear everything, but sometimes it's useful for testing or to allow end users to clear the cache in case of caching bugs.
I would think that the real limiting factor right now is that it is only available for Go applications (its a Go library, not a standalone server). There are comparably very few Go applications in the wild that could use this... where a memcached server can interface with just about anything. Or am I missing something?
Groupcache being a library is what made me so excited about it in the first place.
I'm writing a torrent tracker in Go that could make great use of a distributed cache. This would enable you to put an arbitrary number of "tracker nodes" behind a load-balancer.
One of the goals of my project is that its easy to configure: everything except the RDBMS is hosted inside a single binary and configured with a single file. In the simplest case: that binary is responsible for two web-servers, a process-local cache, and a DB connection pool. (In more complex cases, each binary simply acts as another node behind a load-balancer; and they attempt to cooperate together.)
It's useful to avoid hitting the disk by having the tracker cache torrents that are active. When you start adding more tracker nodes, this is where having a distributed cache would be nice.
Without groupcache, that means I've just forced my users to install and configure memcached. With groupcache, the configuration and server are simply part of my application. (As an added bonus, since Go is statically linked, I haven't even added any external dependencies on shared libraries.)
---
I agree that the use-cases for groupcache are far more limited in scope, but I was still glad to hear that this was a library, and not a standalone server. I think we need _more_ distributed caches implemented as libraries, not less.
There are many use-cases where it may not be preferable to have a separate caching server. In my career: getting approval to install memcached on a customer's server could add _weeks_ to our deploy time simply because of red-tape; in my personal programming endeavors: setting up memcached is just an added layer of complexity that could easily be avoided by using libraries instead of servers.
I'd argue that we should leave memcached [more specifically: standalone distributed caches] for the more complicated scenarios where a dedicated solution is called for. Bring on the distributed-cache libraries where a much simpler, ad-hoc solution would be more beneficial to end-users.
So your key is key + transaction ID, where the transaction ID is simply a value that changes when the data changes. You search the cache with the current value of the ID, and if the data changed the cache entry will be regenerated.
An example of a good transaction ID would be SHA1(data).
Values that must expire could perhaps use a truncated time value as part of the key? It's not as flexible as regular expiration, because the entries would only expire when you cross the point in time where the truncated time value changes, which could be a problem in a lot of applications (sudden surges of requests every x seconds)
I want to use this, but since the keys are immutable, how can I store data like sessions which can change and would sometimes have to be invalidated from the server side (i.e. you can't simply change the session ID in the cookie and use a new cache entry, because bad-guy could still be holding on to an old stolen session ID)?
In general, how can one learn to think in an immutable fashion to effectively exploit this?
If the client has an old link serve an old version. But have links to what will be future versions and make the client walk them. It is doable but different. Lots of stuff is static anyway, like CSS, so you can use for this and have a different process for stuff that varies.
But as the grandparent said, serving the old version may result in a security vulnerability. There are cases where you MUST serve the latest version, and the latest version only.
The general approach is to store initial values and then changesets to those objects.
For example git has loads of immutable data (every commit, tree and blob is immutable), and only very little mutable data (the refs) that point to some of those immutable objects.
You simply wouldn't use groupcache for session data; you'd use memcached.
As others said, groupcache isn't useful for session storage. memcache was used for a lot of things other than caching, because it was a very versatile hammer and a lot of things looked like nails. groupcache is for when your data's immutable and bandwidth used for the most popular keys might exhaust a single memcache server's pipe.
This isn't for sessions unless they're also backed up somewhere else. It's just for caching. You would need some other system where you look up which version of a session is current, then look up sessions in the cache by a unique version id.
Am I reading this as a distributed, immutable, weak hash table rather than what one would consider a 'cache'?
Mind you, doing so avoids the hardest parts of caching (and especially distributed caching, which otherwise begins to underperform around ≥ 5-7 nodes), so I can see significant upside. No surprise stales, distribution update clogging, etc.
I've been thinking about something similar. I don't see how timed expiration would conflict with the two most important features - the filling mechanism and the replication of hot items. Am I missing something that would make timed expiration impossible?
Yeah on the first pass of the problem you seem right.
The CAS must have an authoritative node (my mind wanders thinking about replication and failover) but the key it protects - with the version baked in - can be replicated surely?
CAS is incompatible with the distribution architecture, which uses a best-effort distributed lock in lieu of e.g. a strongly consistent distributed state machine. It would require a lot more work.
If you have a bug in another application on the server running the cache that causes it to grow its memory use, your cache would suddenly disappear/underperform, and the failure could cascade on to the system that the cache is in front of. If instead you let the offending program crash because the cache is using regular memory, this would not happen. Just a thought.
If you hit swap, again only the offending application or instance is punished, not everyone else (for instance by pummeling a backend database server that other services are using as well)
If program A hits swap, it means that cold pages are written to swap so that A can get those pages; this initial writing is done by program A, its true. But A may not be the cause of the problem, A is just the straw that breaks the camel's back.
And those pages that got written to swap likely belong to others, and they pay the cost when they need those pages back...
In my practical experience, when one of my apps hits swap, the whole system becomes distressed. It is not isolated to the 'offender'.
You can of course avoid swap, but with your OS doing overcommit on memory allocations, you are just inviting a completely different way of failing and that too is hard to manage. You end up having to know a lot about your deployment environment and ring-fence memory between components and manage their budgets. If you want to have both app code and cache on the same node - and that's a central tenet of groupcache - then you have to make sure everything is under-dimensioned because the needs of one cannot steal from the other; your cache isn't adaptive.
That's why I built a system to do caching centrally at the OS level.
I hope someone like Brad is browsing here and can make some kind of piecing observation I've missed.
That's rather common. If swapping can harm your application, than don't swap. On a machine where slowdown is tolerable (temporarily, on a desktop), swap is fine. On a machine whose entire purpose is to serve as a fast cache in front of slow storage, swapoff and fall back to shedding or queuing requests at the frontend.
That is my experience as well. In my thought experiment the 'offender' would be a server instance, not a process running among other applications on a single machine. Applications that hit swap often have memory leaks, and hitting swap is then just a matter of time. Creating a cascading failure may be preventable however.
It's a configurable limit. 64MB is just what he set the limit to for that particular group, that's what the "64<<20" in the NewGroup call is for. You could configure it to use many gigabytes.
The fact it doesn't have cache expiration times or inc/dec discards it as memcache replacement, at least for me. Well, and the fact that only go is supported at the moment.
How does its sharding by key algorithm work and how does it handle adding new peers? I was looking for at in the source, but couldn't find anything related to it.
I'm a bit confused about how you use this system with immutable keys. At face value it's a great idea, but I need a simple example of how's it is used to say retrieve a piece of data, then later update that to a new value.
Is this anything like how vector clocks are used, where the client uses the clocks to figure out which is the right state in a distributed system?
The use-case for which this was originally designed was a download server: the values were chunks of files, each of which gets its own unique identifier. If you want to update a file, you simply create a new file for your new version, and point clients at it instead of the old version.
I like the idea, but it seems like it would make deployment a pain. How do I spin up a new server without rebalancing and/or restarting the world? Not to mention that now when I do need to restart the world, I can't do so without also clearing my cache.
* it often doesn't recover from network partitions. Symptoms include both sides of the partition never merging, and deadlocks. Recovering requires rebooting the entire cluster at once. Due to their locking strategy, a rolling reboot of the cluster isn't enough.
*some of the developers didn't seem to demonstrate understanding of races, or distributed systems. When I reported racing bugs, they asked if I could attach reliable unit tests.
Values stored are immutable, and expire when they are no longer used. There is no concept of a TTL.
> How do you set the maximum size of the process's cache?
The cache is broken into a set of named groups, each with their own separate cache and cache-filling method. The size of each group's cache is set when created via NewGroup[1].