
Caching Beyond RAM: The Case for NVMe - Rafuino
https://memcached.org/blog/nvm-caching/
======
Rafuino
Disclosure: I work at Intel and submitted the post.

Dormando mentions the test was done with the help of Accelerate With Optane, a
collaboration we have with Packet to provide access to servers with Intel
Optane SSDs. Check out
[https://www.acceleratewithoptane.com/](https://www.acceleratewithoptane.com/)
for more info, and you can find me and the Packet team over at
slack.packet.net. We're especially interested in open source projects that
want to test what they can do with the tech and are interested in sharing what
they learned with the broader community. Thanks to Dormando for going first!

~~~
ewams
Skylake cpu? Which model?

Assuming using 32GB dimms to use all six lanes (if Skylake)? This is something
that has bit me a few times - most Skylake cpus have 6 lanes, so balanced is
192, 384, etc. Not 4 lanes like we are used to!

Would also recommend trying different ram configs anyways, on our side we have
seen better thru put on 768 than on 384. Even 512 performs better in some
cases than 192.

~~~
Rafuino
[https://www.acceleratewithoptane.com/access/](https://www.acceleratewithoptane.com/access/)
has the server info. We aimed for a balanced configuration to enable the
widest variety of use cases. Not everyone will need to use everything
provided, and it might not fit everyone's needs!

------
bradleyjg
What's the lifetime of an optane or NVMe drive when used as a constantly
thrashing cache? Weeks? Months?

Edit: Missed this the first time through:

> _Calculating this is done by monitoring an SSD 's tolernace of "Drive Writes
> Per Day". If a 1TB device could survive 5 years with 2TB of writes per 24
> hours it has a tolerance of 2 DWPD. Optane has a high tolerance at 30DWPD,
> while a high end flash drive is 3-6DWPD._

~~~
ec109685
They also talk in the paper of keeping the thrashing parts of the cache in
RAM. Facebook for example calculates the effective hit rate of a larger set of
items and only caches those that won’t be quickly purged out / overwritten in
secondary storage.

~~~
ebikelaw
It's only four more years before the patent on adaptive replacement caching
expires. Then we can use it in memcached ...

~~~
zaarn
There is CAR which has almost the same performance and no patent. You can use
that.

~~~
eternalban
Why not just use LRU?

~~~
Symmetry
Performance. The Wikipedia page on cache policies is pretty good.

[https://en.wikipedia.org/wiki/Cache_replacement_policies#Clo...](https://en.wikipedia.org/wiki/Cache_replacement_policies#Clock_with_adaptive_replacement_\(CAR\))

~~~
eternalban
Wiki says "Substantially" better than LRU but actual results seem to show
performance converging to same levels the larger the cache gets. (See page 5.)

[https://dbs.uni-leipzig.de/file/ARC.pdf](https://dbs.uni-
leipzig.de/file/ARC.pdf)

[p.s. there is also the matter of the (patterns in the) various trace runs.
Does anyone know where these traces can be obtained?]

~~~
ebikelaw
All caches have equal hit rates in the limit when the size of the cache
approaches infinity. For finite caches, ARC often wins. In practical
experience I've found that a weighted ARC dramatically outperformed LRU for
DNS RR caching, in terms of both hit rate and raw CPU time spent per access.
This is because it was easy to code an ARC cache that had lock-free access to
frequently referenced items; once an item had been promoted to T2 no locks
were needed for most accesses. With LRU it's necessary to have exclusive
access to the entire cache in order to evict something and add something else.

Of course there are more schemes than just LRU and ARC, and one can try to
employ lock-free schemes more than I'm willing to do. This is just my
experience.

~~~
NovaX
ARC often wins against LRU, but there is a lot of left on the table compared
to other policies. That's because they do capture some frequency, but not very
well imho.

You can mitigate the exclusive lock using a write-ahead log approach [1] [2].
Then you record events into ring buffers, replay in batches, and have an
exclusive tryLock. This works really well in practice and lets you do much
more complex policy work without much less worry about concurrency.

[1] [http://highscalability.com/blog/2016/1/25/design-of-a-
modern...](http://highscalability.com/blog/2016/1/25/design-of-a-modern-
cache.html)

[2] [http://web.cse.ohio-
state.edu/hpcs/WWW/HTML/publications/pap...](http://web.cse.ohio-
state.edu/hpcs/WWW/HTML/publications/papers/TR-09-1.pdf)

------
praseodym
Facebook has recently published a paper on how they use NVM caches with their
MyRocks (MySQL) databases; the morning paper has a really good write-up:
[https://blog.acolyer.org/2018/06/06/reducing-dram-
footprint-...](https://blog.acolyer.org/2018/06/06/reducing-dram-footprint-
with-nvm-in-facebook/)

------
Andys
The limiting factor of Optane is actually the controller. The power
consumption of the controller itself, as well as the costs of getting high
bandwidth to and from CPU and memory. Current products are striking a
conservative balance and in the future they could get much faster.

------
reacharavindh
An adapter that splits a single PCIE slot(x16) to hold 4 x M.2 NVMe SSDs(x4
each) would be a great way to persist a Redis instance that is not just
serving as a cache.

If the same can be done with Optane SSDs, the lower latency will at higher
queue depth will certainly help.

~~~
robhu
This already exists: [https://www.tweaktown.com/reviews/8542/asrock-ultra-
quad-2-c...](https://www.tweaktown.com/reviews/8542/asrock-ultra-
quad-2-card-16-lane-aic-review/index.html)

~~~
namibj
This is not a low profile card, and wastes quite a lot of space. It should
take two cards on each side of the board, with the connectors facing
orthogonal to those of the x16 slot.

~~~
wtallis
You can't put something as tall as a M.2 connector on the back side of an
expansion slot without violating the form factor guidelines and enchroaching
on the space of the next slot over. The only compliant way to put M.2 drives
on the back is to use an offset edge connector so the main board is a bit
lower than it usually would be. Amfeltec has some boards like this, but I
think they have a patent on their offset connector.
[http://amfeltec.com/products/pci-express-gen-3-carrier-
board...](http://amfeltec.com/products/pci-express-gen-3-carrier-board-
for-4-m-2-ssd-modules/)

~~~
namibj
Oh. Well, there are cases where it would fit without such an exotic connector,
but those are non-compliant.

I assume you don't need licenses for just making a dumb PCIe card, if you
don't name it with trademarks? Or are there patents you need to license to
sell PCIe-compatible, non-electronic cards?

------
mmt
What would be interesting to see is if this benefit would be better applied
directly to the database instead.

The relative scarcity of NVMe ports/bandwidth per server may make that as
unattractive as doing the (RAM) caching on the database server itself, but
it's not obvious, if one could only spend the money in one place, where it
would be best spent.

~~~
rzzzt
NVMe drives use 2-4 PCIe lanes, so you can have quite a few of them in a
system with a suitable adapter.

~~~
mmt
I'd say that's comparable scarcity to DIMM slots, as the ratio between those
and PCIe channels tends to be in the range of 1:2-4 on current CPUs.

------
majidazimi
Is there any reason why memcahced/redis doesn't provide external store as an
interface and let people implement the load/store operations? I would like to
be able to implement hierarchy of "hot(RAM) -> warm(NVM) -> cold(SSD) -> DB"
style cache.

~~~
dormando
no reason, it's doable with a bit more cleanup. The I/O bits are encapsulated
entirely in extstore.c with a relatively clean interface. it's not a layered
setup, it's actually bolted all into the same hash table, which makes
pluggable I/O systems not make much sense.

it ensures that a lot of operations can't touch secondary (like miss, touch,
delete, sets of new items, etc), which reduces load on the IO by quite a lot.

edit: Also extstore itself will support NVM + SSD sort-of-layers soon enough.
I'll be retesting that on the same optane+ssd machine in a couple weeks.

------
hyc_symas
Could have just used LMDB memcachedb and gotten the RAM/NVMe cache management
transparently, for free.

[https://github.com/LMDB/memcachedb](https://github.com/LMDB/memcachedb)

------
ihsw2
It is very exciting to see Optane competitive with conventional NVMe drives --
this is something that Intel got right early and their exploration into DIMM-
socketed Optane drives is similarly exciting.

It has already been established that for most consumer workloads, the latency
differences between memory and Optane is negligible. This article shows that
heavy-duty workloads (ie: high-traffic memcached clusters) can be accommodated
by Optane and NVMe too. Clusters of 500K IOPS drives can take us most of the
way there.

I don't want to get all /r/hailcorporate but Optane drives are great products
and (more importantly) you can run them on AMD platforms (ie: SP3) too.
Granted, NVMe drives bring the fight and they're much more cost competitive at
scale, but that will change soon.

~~~
selectodude
I love toys and I would love to jump feet first into optane, but a dollar per
gigabyte is hard to stomach.

~~~
JoshTriplett
About a decade ago, I paid around $600 for the 80GB X25-M, the first SSD that
focused on performance rather than _just_ low latency. One of the best system
upgrades I've ever had.

A dollar per gigabyte doesn't seem half bad at all for the top end of
performance.

~~~
berbec
My first hard drive upgrade was a 80MB HardCard - an ISA hard disk for my PC
XT 286. I recall it cost clost to $1,000.

There's always going to be the top tier storage that costs an arm and a leg -
that just means two steps down gets affordable. Optane pricing will drive down
NVME which will drive down SATA SSD.

~~~
dragontamer
The main issue is that video gamers don't see any benefits to NVMe NAND
drives. So the "hardcore gamer" market is beginning to stall out on SATA SSD
drives.

Why pay 2x more for NVMe NAND if your video game load times aren't any better?

~~~
zlynx
They are better at load times. Just not 2x better.

~~~
berbec
The price/performance of a couple 860 Evos in raid0 is hard to beat

