Hacker News new | past | comments | ask | show | jobs | submit login
Memcached 1.5.18 can now recover its cache between restarts (github.com/memcached)
245 points by archon810 on Sept 18, 2019 | hide | past | favorite | 46 comments

Its a great addition to Memcached. Using it since very early years when setting up java/php based systems to improve performance.

This is nice feature. I hope other software can write documentation in satirical form like Memcache. [1]

[1] https://github.com/memcached/memcached/wiki/TutorialCachingS...

This is great! In my case, only reason for choosing Redis over Memcached was persisting to disk. With memcached being multi-threaded compared to Redis being single-threaded, I see a big win in simple use-cases for Memcached.

The way i understand it, it is not crash safe. It can now persist data in some cases.

However redis is crash safe to about 1 second before crash or so, if not even better.

With fsync_always it's crash safe for all but the most pathological scenarios.

But it's performance is way worst, almost voiding the case for choosing Redis.

I have not found this to be true. I suppose it depends on your use case and deployment.

How? It's persisting to a RAM disk. It only properly cleans up when sent a specific shutdown signal.

Sorry fsync_always is a redis setting, not memcached. The fsync is to a real disk.

The parent poster's comment about redis being crash safe to 1s is just the default.

I do wonder what happens when a process writes to a memory mapped file, but crashes before the page is synced to disk. Does the write disappear?

If the OS is still alive, everything is ok (but the OS may sync different pages at different times and in any order if you don't fsync specific ranges with special calls). If the whole machine crashes (power issue, kernel panic, ...), what you find on disk can be a mess.

> what you find on disk can be a mess

Is this "can be" determined by the backing filesystem? E.g., would you expect to get a clean result from a log-structured or copy-on-write filesystem?

I prefer to architect to use redis in an ephemeral way. Redis isn't exactly safe in the same way postgresql was up until very recently on newer linux. The semantics of fsync on Linux have been esoteric and poorly understood in the error cases. I would try to cause fsync to fail in another process, while memcached is shutting down and immediately recover. I wonder if the authors checked this scenario. Redis kindof does the right thing and will eventually put the right thing on disk but why do it?

What issue caused postgres to not be safe recently?

Fsync doesn't work the way you think it does


PostgreSQL › wiki › Fsync_Errors Web results Fsync Errors - PostgreSQL wiki

LWN.net › Articles PostgreSQL's fsync() surprise [LWN.net]

Note that that only was an issue in cases the storage system itself was failing (i.e. IO errors were generated). In contrast to protection against power failures etc, which was/is working correctly.

Uhm the first sentence is accurate. The second is not

I'm fairly sure I was accurate. What exactly are you referring to?

Sharding is a fairly easy path to scale redis writes out across multiple processes/cores.

Well done to them for getting the feature out. I've been a long time user of Memcached.

Lol interesting the way we used it at my last place of business was to restart the memcached server every 5 minutes (cron). Because as we know.... there are only really two hard problems in computer science. Naming things and invalidating cache :D

You forgot off by one errors

> Your system clock must be set correctly, or at least it must not move before memcached has restarted. The only way to tell how much time has passed while the binary was dead, is to check the system clock. If the clock jumps forward or backwards it could impact items with a specific TTL.

Why not write the system timestamp to the memory state file as well? And then "evict" (ie just not load) anything exceeding TTL on state file load?

That's exactly what it does, though since it's resuming a monotonic clock the actual code was a bit more complicated...

To find out how long it was down, it notes system time into the state file on shutdown. On start it checks the current system time and adds the delta to the monotonic timer and resumes. Objects exceeding TTL are removed appropriately.

I appreciate that you took the time to explain this.

I think they use the system clock to determine the time after the restart

I might’ve just misunderstand that paragraph. I’d assume with $snapshot_timestamp, current time, and TTLs, time wouldn’t be an issue (and of course, don’t futz with system time while memcached isn’t running).

Not really. If it's 4:00:00 and the system clock says 4:20:00, and then it exits... then 21 minutes goes by and the system clock is set correctly. It is indistinguishable from the clock not stepping and 1 minute going by.

> (and of course, don’t futz with system time while memcached isn’t running).

Well, yes, that's what the original thing you quoted said: "Your system clock must be set correctly, or at least it must not move before memcached has restarted."

I’ve never worked in environments where real time and system time had that much drift (due to ntp), but I acknowledge it probably happens out there in distributed systems. Accurate time is important!

Before memcached had a monotonic clock people would end up with immortal objects (underflowed TTL's) because ntp would start after memcached and make a huge adjustment due to the hardware clock being really off.

With the restart code, people could run a kernel upgrade and reboot while the daemon is down... so if this ends up causing a huge clock adjustment you're screwed.

And the more granular the time-based caching is within that system the more likely that mine time skew can kill cache.

I've had distributed systems perform unreliably in the < 30s range even with ntpupdate syncing in place.

So its now a database ;)

Ha! Funny.

Cool. On this, though:

"/tmpfs_mount/ must be a ram disk of some sort"

Curious if it really must be, or if that's just recommended for speed.

Presumably it works by mapping that file underneath a regular memory allocator, that means extremely small and random IO patterns. I think "must" is appropriate as a real backing file would likely mean a slowdown of 1-2 orders of magnitude! Compare to a typical DB system where the allocation structure always has some kind of large block IO locality by design

The alternative to 'must' is 'should', and people ignore that more easily, resulting in bug reports like 'memcached performance abysmal when running with disk', 'why does memcache cause 1MB of IO when I only read 256 4-byte keys', etc

Not every memory access to a memory mapped file is automatically paged out immediately, so depending on vm dirty bytes, etc, you may get away with it for a while.

But the worst case will be dismal and not unlikely.

You can do it at really low load levels, but you'll lose performance consistency. mmap'ing files over a real filesystem of any kind is super complicated.

The access pattern isn't optimized at all for flash or HDD or etc... however it does work super well if that mount happens to be a DAX mount over persistent memory.

Can't you just pass in a raw block device, get rid of any fsyncs and you got some very volatile backing store? The OS will write things out as best as it can, and if the whole thing fits in memory (and it's a memcached instance, so it should after all), then read-write will be fast. And if you want to restart, just stop the memcached process, do a sync, wait for it ... and reboot. (This seems simpler than copying the tmpfs, but a lot less safe/deterministic?)

You can try, but it's not going to be very consistent. You're at the mercy of whenever the OS decides to start flushing pages.. head of line blocking, mmap_sem locks, file descriptor locks, etc. My old job had an mmap-on-disk storage engine at scale and it wasn't any fun.

I think the worst part is mmap-on-disk looks fine at first, and only comes out as a problem after you scale up a while. False sense of security. :/

The ramdisk requirement means the data is still lost between reboots, right?

Keep reading. You can use persistent memory for the ramdisk so it'll be maintained between system reboots https://memcached.org/blog/persistent-memory/

Ah, the difference between RAM: and RAD: on the Amiga. Nice. :)

Yup, but you can copy that file (+ the .meta file) to disk, reboot, copy back, etc. It'll take a long time for a large cache unless your disk is fast.

If you have some kind of pmem device and a dax mount, those can survive reboots on their own.

It's an interesting choice, writing to a mmap file. I wonder how the performance is impacted as compared to just mandating the write be replicated to a nearby memcached instance (like you can do with Redis).

In this case, if you're using tmpfs like we recommend the performance is identical. It's the same as if we used normal shared memory (or so far as I can tell via benchmarks).

Even with an optane pmem mount the perf. is super close.

Seems like that should be a 1.6.x feature of they're following semvar

The important thing here is that this fixes the cold cache problem. for good.

i love memcached a lot, it took too long to get here and this wasn’t the primary use case of memcache

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact