
We Replaced an SSD with Storage Class Memory - ssvss
https://engineering.mongodb.com/post/we-replaced-an-ssd-with-storage-class-memory-here-is-what-we-learned
======
georgewfraser
Andy Pavlo talks about this in his class at CMU. You shouldn’t expect to get
better performance by running a disk-optimized storage engine on memory,
because you’re still paying all the overhead of locks and pages to work around
the latency of disk, even though that latency no longer exists. Instead, you
have to build a new, simpler storage engine that skips all the bookkeeping of
a disk-oriented storage engine.

[https://youtu.be/a70jRWLjQFk](https://youtu.be/a70jRWLjQFk)

~~~
pocket_cheese
For anyone interested in how a database works underneath the hood, I could not
recommend a better MOOC than Andy Pavlo's database courses. His intro to
database course is so freaking good.

------
renewiltord
Jesus Christ, this is insane. Almost a Terabyte of 12.6 Gbps reads? I have a
bunch of geospatial entity resolution workloads that I could absolutely smash
with this. For way cheaper than the fat mem instances.

~~~
lrem
Back in the day when I interviewed for Google I had this beautiful question.
The interviewer fished for a basic distributed key-value store. I just kept
coming up with single machine solution to his numbers. "No, I really can have
that storage+bandwidth, here's the part number."

I'm still wondering if that interview costed me a level.

~~~
asdfasgasdgasdg
Even if you can do a particular problem on a single machine, sometimes that
isn't the right call. In a work scheduled cluster environment, a task that
wants the entire machine may have trouble getting a slot unless it has the
priority to preempt everything else going on on that machine. We call such VMs
"picky" and they don't get scheduling guarantees.

~~~
michaelt
I'm having trouble thinking of how you'd end up with a "cluster" that couldn't
provide the power of a single machine?

~~~
throwaway_pdp09
[http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...](http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)

"Rather than making your computation go faster, the systems introduce
substantial overheads which can require large compute clusters just to bring
under control.

In many cases, you’d be better off running the same computation on your
laptop."

My limited experience fits this in that a bit of smarts on a single box beats
a bunch of boxes.

(the link is a very good read BTW)

~~~
yencabulator
The biggest problem with that write-up is ignoring availability. "Fast all the
way up to the crash" can be much worse than slow and steady.

Of course, for a batch job with a runtime under 11 minutes, that probably
doesn't really matter too much. Just don't generalize that too much.

------
cm2187
Stupid question. What are the use cases for such massively fast write speeds?

If you are storing data to disk at that speed, you fill even the biggest
optane drives in a couple of minutes. So it would be an application where you
need to overwrite a huge amount of data over and over again.

~~~
stonewhite
That also necessitates a matching network bandwidth as well.

It reminds me of the Bugatti Veyron situation, its tires lasting 15 minutes at
top speed yet it runs out of fuel in 12.

~~~
falcolas
I'd rather run out of fuel and drift to a stop, instead of having a blowout
and crash while driving at top speed. I'm not sure where the problem lies.

~~~
ReactiveJelly
It's not a problem with the car, it's an analogy that the fuel and the network
are bottlenecks where normally the tires and storage might be.

~~~
falcolas
Still doesn’t make sense. The car is correctly configured (for safety
purposes), and it would also make sense to be able to write to storage faster
than your network speeds (no database writes exactly what it receives from the
network, it adds metadata and structuring).

------
lichtenberger
You can efficiently read 256 Byte granular data (4 cache lines) with Optane
Memory (due to checksums). I think it makes much more sense to read/write fine
granular changes for instance at least align pages to 64 or 256 Bytes instead
of 4kb pages, where you often times first of all write too much data and
secondly you pollute the caches with probably unnecessary data. There's a
paper about how to add cache line aligned mini-pages (16 cache-lines):
[https://db.in.tum.de/~leis/papers/nvm.pdf](https://db.in.tum.de/~leis/papers/nvm.pdf)

~~~
ayende
Writing on a page (4KB / 8KB) boundary is almost always a good idea. Because
there is an overhead per page that you have to account for. It can be in the
range of 16 - 64 bytes, so having a 256 mini page is probably a bad idea.

Most of the _data_ is also not going to fit in 256 bytes anyway.

~~~
lichtenberger
It might add some overhead, but I guess it depends on the page implementation,
but 16 bytes seems to be the minimum (and Optane Memory might , I agree. That
said if someone changes only one record the best thing is to write in the
smallest granularity possible on the storage medium.

So, it might well be that someone is only interested in only one or a few
records. Why then fetch and cache a whole 4Kb page if latency is good in both
cases (4kb and 256 bytes)? On the other hand I agree that you should probably
cache more data from a hot page.

------
Pelic4n
So you need to do THAT to get decent perfs with MongoDB. Good to know!

~~~
jiggawatts
Unless I'm missing something, their benchmark graphs at the bottom of the
report show that there is no significant benefit to using SCM with MongoDB!
Their internal overheads must be high enough to swamp the few microseconds
gained from the faster storage.

~~~
Pelic4n
That was a joke. I've been using MongoDB in production for years and I'm
salty. :)

