
How to Build a Non-Volatile Memory DBMS - blopeur
https://www.cs.cmu.edu/~jarulraj/pages/sigmod_2017_tutorial.html
======
hyc_symas
Eh. None of this is new, and we already anticipated this with LMDB back in
2009.

The fact that NVRAM is directly addressable (and thus can bypass the page
cache) will eventually play out as irrelevant. It will always be a fact that
slower mass-storage will exist, more cheaply than fast in-core storage,
persistent or not. The page cache will still be needed even for NVRAM, and the
"page" will still be the necessary atomic unit of memory interchange. (Direct
access is of course a great thing, but it will be direct access to _virtual_
addresses. Virtual memory, and paging in/out between primary and secondary
storage, is never going away. Every commodity system that has tried to do
without PMMUs has failed, for numerous good reasons.)

The continual allusions to "frequent writes can destroy memory cells" seems to
mainly relate to the extremely short lifetime of Intel's 3DXpoint memory,
which is from every measure a total failure.

[http://semiaccurate.com/2017/03/10/intel-mislead-press-
xpoin...](http://semiaccurate.com/2017/03/10/intel-mislead-press-xpoint-next-
week/)

It would be best to ignore 3DXpoint and just focus on STT-MRAM, which is
already at parity with DRAM for endurance. (But still lacking in density.)

Much of the other stuff in those slides is still off the mark. E.g., LMDB
today has perfect crash reliability with zero recovery time. The "write behind
logging" they propose still has logging overhead and non-zero crash recovery
time - which equals wasted work. Anything that requires logging or any form of
compaction or garbage collection is wasted work. It's completely unnecessary,
and that has nothing to do with NVM. Treating NVM DB design as if it's an
entirely new and different animal is frankly ignorant. The right design works
in all scenarios - as LMDB does.

~~~
cryptonector
I'm not sure that using mmap(2) is enough. You're still writing pages, so
you're writing more than you have to, so writes will not go as fast on NVM as
they could.

LMDB does logging since it does copy-on-write -- it just reuses free pages as
soon as possible, which means that over time the log disappears. (LMDB is not
an append-only DB.)

I agree as to GCs. Having to GC is not just wasted work, it's a performance
disaster, and any study that hand-waves about GC without actually measuring
its impact on performance is fatally flawed. Assume petabytes of data, and
assume performance between SSDs and DRAM: GC will still take enormous amounts
of time and I/O that could have been used for something else. Write-behind
logging is terrible if it means you need a GC. Why even bother with write-
behind logging if a failure means that you must GC anyways.

Only write-ahead logging helps you avoid a GC. A write-ahead log could be
optimized to log only (address, length) tuples for each transaction so as to
minimize WAL writes.

However, the insights about smaller-than-page I/Os, particularly as to writes,
seem likely to be correct. Though having anything like a b-tree on NVM with
smaller-than-page writes seems to me to imply in-place writes, but I'm not
ready to give up on COW.

~~~
mtanski
> I'm not sure that using mmap(2) is enough. You're still writing pages, so
> you're writing more than you have to, so writes will not go as fast on NVM
> as they could.

If you're using the new fangled memory there's a few components in play.

First in Linux there's DAX. DAX lets you mmap in the devices pages directly
without going through the page cache. And the processor can handle this like a
memory mmaped device (without interactions from the OS) including subpage
read/writes.

Second, Intel chose to reuse previously existing operations that you would
normally use for flushing cache lines (to main memory). Previously there were
going to use additional pcommit operation. [https://software.intel.com/en-
us/blogs/2016/09/12/deprecate-...](https://software.intel.com/en-
us/blogs/2016/09/12/deprecate-pcommit-instruction)

~~~
cryptonector
I got that. My point is that merely using this is insufficient to make your DB
go fast on NVM. As the presentation says, you need to consider changing the
on-disk^W^W^Wpersistent storage format to be one that writes less (not so much
because of write cycle limits, but because writes are slower, and to reduce
the number of additional writes needed in, say, a COW format.

------
nuopnu
The presentation and the paper pretty much describe how LMDB works already.

[https://symas.com/lightning-memory-mapped-
database/](https://symas.com/lightning-memory-mapped-database/)

Except, it allows only a single writer and it's not directly byte-addressable,
unless you make a request to preallocate a chunk of fixed-size mmap'ed memory.

~~~
valarauca1
Yes but in byte-addressable mode you can trust the map is fixed, which doesn't
seem to be the case here.

~~~
nuopnu
Not exactly. I'm not a user of this, but

[http://www.lmdb.tech/doc/group__mdb__env.html#ga492952277c48...](http://www.lmdb.tech/doc/group__mdb__env.html#ga492952277c481bc4a6fa08ef71c29487)

Ctrl+f through the doc for "MDB_FIXEDMAP" for more details.

So there is _some_ support, where you can expect to have the whole file at a
fixed address. What is not handled is that your data may be moved within the
file, and there's nothing you can do about it, short of keeping a long-lived
read only transaction:

[http://www.lmdb.tech/doc/todo.html](http://www.lmdb.tech/doc/todo.html)

~~~
cryptonector
Note that the LMDB approach, and especially if you want MDB_FIXEDMAP, limits
DB size to the largest mmap()ing you can get at a fixed location. That's not
good in a world of 48-bit address spaces.

~~~
nuopnu
Practically it's even less than 47 bits. :) But sure, know your limits.

~~~
cryptonector
Yes, I know :)

------
0xFFC
I am currently thinking about stating graduate school and this dude "Andy
Pavlo" is my hero, the amount of information he pumps (I don't know any other
world) to his listeners is outlandish while staying fun and serious and
practical at the same time. I watched most of his lectures. I have done some
work in industrial reaseach lab with cuda. I did OS a lot and it was my first
choice. But after watching his lectures i am seriously considering changing my
field to DB. Specially i can do use my Operation Research knowledge in DB much
more effective than OS.

~~~
hyc_symas
You cannot write a good database without a solid understanding of operating
systems and computer architecture.

------
crispyambulance
"Non Volatile Memory" really means Intel's "3D X-point" memory-- estimated
availability of this product for system memory is 2018-ish.

[http://www.tomshardware.com/news/intel-optane-cascade-
lake-d...](http://www.tomshardware.com/news/intel-optane-cascade-lake-
dimm,34471.html)

~~~
Quequau
Optane is on sale now.

~~~
hoschicz
Optane's performance is hardly on par with SSDs, let alone DRAM.

XPoint is expected to have performance much higher than Optane.

~~~
valarauca1
Optane is the "marketing name" the technology behind Optane is 3D-XPoint.

~~~
Babooster
The Optane available currently isn't in DIMM-format which is expected to give
a further boost though.

~~~
valarauca1
This is true. Bypassing the page cache and VFS will be a net win. But all my
storage needs can't be done in MMAPs

------
michaelmior
Video is available
[https://www.youtube.com/watch?v=ljrpXVlkQ84](https://www.youtube.com/watch?v=ljrpXVlkQ84)

~~~
DocSavage
Seems to be missing first part of talk and also had no sound. Hopefully these
will be fixed in future.

------
gghh
anyone knows if a preprint of the paper is available? both the "paper" and the
"slides" link point to the same document, i.e. the slides.

~~~
michaelmior
There typically isn't a paper associated with tutorials. I'm guessing they
just copied from a standard template that had apace for a paper.

------
skybrian
Since machines can crash, possibly corrupting data, it seems like you still
need replication? The network might be the new bottleneck.

~~~
jfoutz
Journal files. You can begin a transaction, crash, and lose the transaction.
But the thing itself will be consistent.

Perhaps there's something funny about the failure modes of non-volatile memory
that's different than a disk. Replication can improve things, sure. there's
always a chance of a cosmic ray doing something weird to your stuff. That's
(so far) a lot less common than a regular old crash.

~~~
skybrian
I was thinking along the lines of the machine being on the wrong side of a
network partition and then disappearing entirely. Cloud services are supposed
to survive the loss of any single machine.

~~~
jfoutz
Ah, i think you're right, but that's a different problem. The Jepsen talks are
enlightening, as so many errors get highlighted. Ultimately, if your algorithm
is right, the hardware won't matter at all. Each node could be a person with
pencil and paper, and the messages are lossy paper airplanes.

The NVM stuff (i think) just changes parameters to cache size, journal log
size, stuff like that. I mean, it's cool tech, and it'll probably make stuff
faster. But the reliability comes from the architecture and faithful
implementation.

------
ComodoHacker
>Assistant Professor of _Databaseology_

Oh, I didn't know that's a thing.

~~~
Babooster
Sounds like Indiana Jones. Data mining ancient databases.

