Memcached 1.5.18 can now recover its cache between restarts

dragonsh · on Sept 19, 2019

Its a great addition to Memcached. Using it since very early years when setting up java/php based systems to improve performance.

This is nice feature. I hope other software can write documentation in satirical form like Memcache. [1]

[1] https://github.com/memcached/memcached/wiki/TutorialCachingS...

ara24 · on Sept 19, 2019

This is great! In my case, only reason for choosing Redis over Memcached was persisting to disk. With memcached being multi-threaded compared to Redis being single-threaded, I see a big win in simple use-cases for Memcached.

halukakin · on Sept 19, 2019

The way i understand it, it is not crash safe. It can now persist data in some cases.

However redis is crash safe to about 1 second before crash or so, if not even better.

dallbee · on Sept 19, 2019

With fsync_always it's crash safe for all but the most pathological scenarios.

gokhan · on Sept 19, 2019

But it's performance is way worst, almost voiding the case for choosing Redis.

RhodesianHunter · on Sept 19, 2019

I have not found this to be true. I suppose it depends on your use case and deployment.

TylerE · on Sept 19, 2019

How? It's persisting to a RAM disk. It only properly cleans up when sent a specific shutdown signal.

dallbee · on Sept 19, 2019

Sorry fsync_always is a redis setting, not memcached. The fsync is to a real disk.

The parent poster's comment about redis being crash safe to 1s is just the default.

tus88 · on Sept 19, 2019

I do wonder what happens when a process writes to a memory mapped file, but crashes before the page is synced to disk. Does the write disappear?

antirez · on Sept 19, 2019

If the OS is still alive, everything is ok (but the OS may sync different pages at different times and in any order if you don't fsync specific ranges with special calls). If the whole machine crashes (power issue, kernel panic, ...), what you find on disk can be a mess.

derefr · on Sept 19, 2019

> what you find on disk can be a mess

Is this "can be" determined by the backing filesystem? E.g., would you expect to get a clean result from a log-structured or copy-on-write filesystem?

alexnewman · on Sept 19, 2019

I prefer to architect to use redis in an ephemeral way. Redis isn't exactly safe in the same way postgresql was up until very recently on newer linux. The semantics of fsync on Linux have been esoteric and poorly understood in the error cases. I would try to cause fsync to fail in another process, while memcached is shutting down and immediately recover. I wonder if the authors checked this scenario. Redis kindof does the right thing and will eventually put the right thing on disk but why do it?

coded · on Sept 19, 2019

What issue caused postgres to not be safe recently?

alexnewman · on Sept 21, 2019

Fsync doesn't work the way you think it does

https://www.google.com/url?sa=t&source=web&rct=j&url=https:/...

PostgreSQL › wiki › Fsync_Errors Web results Fsync Errors - PostgreSQL wiki

LWN.net › Articles PostgreSQL's fsync() surprise [LWN.net]

anarazel · on Sept 21, 2019

Note that that only was an issue in cases the storage system itself was failing (i.e. IO errors were generated). In contrast to protection against power failures etc, which was/is working correctly.

alexnewman · on Sept 25, 2019

Uhm the first sentence is accurate. The second is not

anarazel · on Sept 25, 2019

I'm fairly sure I was accurate. What exactly are you referring to?

chisleu · on Sept 19, 2019

Sharding is a fairly easy path to scale redis writes out across multiple processes/cores.

rawoke083600 · on Sept 19, 2019

Well done to them for getting the feature out. I've been a long time user of Memcached.

Lol interesting the way we used it at my last place of business was to restart the memcached server every 5 minutes (cron). Because as we know.... there are only really two hard problems in computer science. Naming things and invalidating cache :D

snarf21 · on Sept 19, 2019

You forgot off by one errors

toomuchtodo · on Sept 19, 2019

> Your system clock must be set correctly, or at least it must not move before memcached has restarted. The only way to tell how much time has passed while the binary was dead, is to check the system clock. If the clock jumps forward or backwards it could impact items with a specific TTL.

Why not write the system timestamp to the memory state file as well? And then "evict" (ie just not load) anything exceeding TTL on state file load?

dormando · on Sept 19, 2019

That's exactly what it does, though since it's resuming a monotonic clock the actual code was a bit more complicated...

To find out how long it was down, it notes system time into the state file on shutdown. On start it checks the current system time and adds the delta to the monotonic timer and resumes. Objects exceeding TTL are removed appropriately.

toomuchtodo · on Sept 19, 2019

I appreciate that you took the time to explain this.

xhgdvjky · on Sept 19, 2019

I think they use the system clock to determine the time after the restart

toomuchtodo · on Sept 19, 2019

I might’ve just misunderstand that paragraph. I’d assume with $snapshot_timestamp, current time, and TTLs, time wouldn’t be an issue (and of course, don’t futz with system time while memcached isn’t running).

mlyle · on Sept 19, 2019

Not really. If it's 4:00:00 and the system clock says 4:20:00, and then it exits... then 21 minutes goes by and the system clock is set correctly. It is indistinguishable from the clock not stepping and 1 minute going by.

> (and of course, don’t futz with system time while memcached isn’t running).

Well, yes, that's what the original thing you quoted said: "Your system clock must be set correctly, or at least it must not move before memcached has restarted."

toomuchtodo · on Sept 19, 2019

I’ve never worked in environments where real time and system time had that much drift (due to ntp), but I acknowledge it probably happens out there in distributed systems. Accurate time is important!

dormando · on Sept 19, 2019

Before memcached had a monotonic clock people would end up with immortal objects (underflowed TTL's) because ntp would start after memcached and make a huge adjustment due to the hardware clock being really off.

With the restart code, people could run a kernel upgrade and reboot while the daemon is down... so if this ends up causing a huge clock adjustment you're screwed.

nkozyra · on Sept 19, 2019

And the more granular the time-based caching is within that system the more likely that mine time skew can kill cache.

I've had distributed systems perform unreliably in the < 30s range even with ntpupdate syncing in place.

ape4 · on Sept 19, 2019

So its now a database ;)

_m6bh · on Sept 19, 2019

Ha! Funny.

tyingq · on Sept 19, 2019

Cool. On this, though:

"/tmpfs_mount/ must be a ram disk of some sort"

Curious if it really must be, or if that's just recommended for speed.

slovenlyrobot · on Sept 19, 2019

Presumably it works by mapping that file underneath a regular memory allocator, that means extremely small and random IO patterns. I think "must" is appropriate as a real backing file would likely mean a slowdown of 1-2 orders of magnitude! Compare to a typical DB system where the allocation structure always has some kind of large block IO locality by design

The alternative to 'must' is 'should', and people ignore that more easily, resulting in bug reports like 'memcached performance abysmal when running with disk', 'why does memcache cause 1MB of IO when I only read 256 4-byte keys', etc

alexgartrell · on Sept 19, 2019

Not every memory access to a memory mapped file is automatically paged out immediately, so depending on vm dirty bytes, etc, you may get away with it for a while.

But the worst case will be dismal and not unlikely.

dormando · on Sept 19, 2019

You can do it at really low load levels, but you'll lose performance consistency. mmap'ing files over a real filesystem of any kind is super complicated.

The access pattern isn't optimized at all for flash or HDD or etc... however it does work super well if that mount happens to be a DAX mount over persistent memory.

pas · on Sept 19, 2019

Can't you just pass in a raw block device, get rid of any fsyncs and you got some very volatile backing store? The OS will write things out as best as it can, and if the whole thing fits in memory (and it's a memcached instance, so it should after all), then read-write will be fast. And if you want to restart, just stop the memcached process, do a sync, wait for it ... and reboot. (This seems simpler than copying the tmpfs, but a lot less safe/deterministic?)

dormando · on Sept 19, 2019

You can try, but it's not going to be very consistent. You're at the mercy of whenever the OS decides to start flushing pages.. head of line blocking, mmap_sem locks, file descriptor locks, etc. My old job had an mmap-on-disk storage engine at scale and it wasn't any fun.

I think the worst part is mmap-on-disk looks fine at first, and only comes out as a problem after you scale up a while. False sense of security. :/

archon810 · on Sept 19, 2019

The ramdisk requirement means the data is still lost between reboots, right?

joecot · on Sept 19, 2019

Keep reading. You can use persistent memory for the ramdisk so it'll be maintained between system reboots https://memcached.org/blog/persistent-memory/

jacobush · on Sept 19, 2019

Ah, the difference between RAM: and RAD: on the Amiga. Nice. :)

dormando · on Sept 19, 2019

Yup, but you can copy that file (+ the .meta file) to disk, reboot, copy back, etc. It'll take a long time for a large cache unless your disk is fast.

If you have some kind of pmem device and a dax mount, those can survive reboots on their own.

lubesGordi · on Sept 19, 2019

It's an interesting choice, writing to a mmap file. I wonder how the performance is impacted as compared to just mandating the write be replicated to a nearby memcached instance (like you can do with Redis).

dormando · on Sept 19, 2019

In this case, if you're using tmpfs like we recommend the performance is identical. It's the same as if we used normal shared memory (or so far as I can tell via benchmarks).

Even with an optane pmem mount the perf. is super close.

exabrial · on Sept 19, 2019

Seems like that should be a 1.6.x feature of they're following semvar

netik · on Sept 19, 2019

The important thing here is that this fixes the cold cache problem. for good.

netik · on Sept 19, 2019

i love memcached a lot, it took too long to get here and this wasn’t the primary use case of memcache