
LMDB: Intel Optane SSD Microbenchmark - hyc_symas
http://www.lmdb.tech/bench/optanessd/
======
haneefmubarak
I'm really curious as to why RocksDB was so slow at loading data from the
Optane. An uneducated guess would be that there's some weird interactions with
(f)sync () but I'd love to hear anyone else's idea as to why that might be.

~~~
hyc_symas
I'd really love to know too, but so far we haven't found any explanations. The
loader is actually using all asynchronous writes, so there are no fsync()
calls during the loading phase at all. The CPU usage is about the same as for
the load on flash, so the delay can only be attributable to the actual I/O
operations. Perhaps this data set just magically happens to cause a write
access pattern that's pathologically bad for Optane SSDs. We re-tested
multiple times though and always had the same result - runs fine on flash,
slows down on optane.

Even LMDB on the raw block device, which immediately did a physical write upon
every logical write, wasn't that slow. It's just bizarre.

~~~
mehrdadn
I have zero ideas how any of these work underneath, but out of a wild guess,
is there any chance that alignment (or similar) is the issue? e.g. maybe
RocksDB tries to align its writes to some boundary in a way that's awful for
Optane's technology?

~~~
wtallis
Alignment sounds like a plausible factor, but in my testing I've found that
the penalty for unaligned writes on the Optane P4800X is less than a factor of
two.

~~~
hyc_symas
Hadn't really thought of that. I suppose we could have retried by making a
single partition taking up the entire space, and letting fdisk take care of
aligning the start of the partition. But I assumed that alignment wouldn't be
an issue when using the unpartitioned block device.

------
Heag3aec
> With LMDB on the raw block device, each write of a record results in an
> immediate write to the device

Shouldn't they use mmap instead of syscalls for byte-addressable storage to
avoid switches into kernel space?

~~~
wtallis
NVMe SSDs aren't byte addressable even when their underlying storage medium
is. Optane SSDs use 512B or 4kB sectors just like any other block storage
device. Actual byte-addressable Optane DIMMs are just starting to become
available, though only to major cloud computing providers so far.

Using mmap with infrequent msync calls would mean you're running with looser
data consistency guarantees. That may be suitable for some use cases, but it
doesn't necessarily make for fair benchmarking.

~~~
hyc_symas
The loader doesn't perform any sync calls. It really doesn't need to since the
DB is larger than RAM - eventually the FS cache fills and then every newly
written page will force an existing dirty page to get flushed.

To answer the parent post, when using the raw block device we're actually
using mmap already. That's what LMDB means: Lightning Memory-Mapped Database.
And while not all raw devices support read/write calls, if they support mmap
we use it. But writing thru mmap actually performs poorly for larger-than-RAM
DBs. Whenever you access a new page, the OS takes a page fault to page it in
from storage first. It's a wasted I/O in this case because we're about to
overwrite the entire page.

------
grogers
What are you doing differently in this test that your writes/sec number for
RocksDB is only 5k, whereas your previous benchmarks several years ago show
100k?

~~~
hyc_symas
Excellent question. Several years ago RocksDB didn't support ACID
transactions. This test turns on transaction support in RocksDB.

------
dis-sys
Can't wait to see Optane as RAM tests using lmdb.

~~~
hyc_symas
I hope to have the next writeup done in a couple days...

~~~
tanelpoder
I wasn't aware that the Xeon Gold CPUs (61xx instead of Xeon E5 or E7 x6xx
models) also support the Intel Memory Drive Technology. If yes, then great! Or
are you going to use a different server for the IMDT test?

~~~
hyc_symas
The same server is used for the IMDT tests. From what I saw, IMDT is basically
a thin hypervisor that maps the SSD into the address space and then handles
the page faults before the main OS ever sees them. I don't think this requires
anything that any arbitrary CPU/MMU couldn't handle.

------
ghc
I can't wait until we see Optane DIMMs come to market. It's no surprise that
Optane doesn't perform great as an SSD when every part of the stack is so
highly optimized to minimize the impact of SSD limitations.

~~~
hyc_symas
We'll have test results on Optane DIMMs posted "soon" \- as soon as the NDA
expires...

~~~
ryanworl
Did you use the cache-line flush instructions directly or use some higher-
level package from the PMDK for adding byte-addressable NVM support to LMDB?

~~~
hyc_symas
Nope, nor was any of that necessary. LMDB works reliably with Optane DIMMs
already, unmodified, by default. By default, LMDB uses a read-only mmap and
write() syscalls, so it's the OS's job to persist the data. If you used the
MDB_WRITEMAP option to write through the mmap, then you would indeed need to
add explicit cache-line flush instructions. But that's not the recommended way
to use LMDB.

