Hacker News new | comments | ask | show | jobs | submit login
LMDB: Intel Optane SSD Microbenchmark (lmdb.tech)
107 points by hyc_symas 75 days ago | hide | past | web | favorite | 27 comments

I'm really curious as to why RocksDB was so slow at loading data from the Optane. An uneducated guess would be that there's some weird interactions with (f)sync () but I'd love to hear anyone else's idea as to why that might be.

I'd really love to know too, but so far we haven't found any explanations. The loader is actually using all asynchronous writes, so there are no fsync() calls during the loading phase at all. The CPU usage is about the same as for the load on flash, so the delay can only be attributable to the actual I/O operations. Perhaps this data set just magically happens to cause a write access pattern that's pathologically bad for Optane SSDs. We re-tested multiple times though and always had the same result - runs fine on flash, slows down on optane.

Even LMDB on the raw block device, which immediately did a physical write upon every logical write, wasn't that slow. It's just bizarre.

The LMDB test is a bit contrived. By choosing a key of 16 bytes and a value of 4000 bytes, it's nearly optimal. The underlying block device wants 4KiB aligned writes for best performance, and that's pretty much what this test is doing. RocksDB is best at performing writes into very small records, where it can coalesce them and reduce write amplification. LMDB can't do this. As the size of the records increases, the RocksDB performance advantage disappears, and then it eventually loses out. In my own tests, I've found that the crossover point is about 250 bytes. A better test comparing LMDB to RocksDB would show what this crossover is, in a more controlled fashion than what I have done.

That's definitely a problem because the same yet to understand pattern may happen later with lmdb if there are the conditions (given that they are not understood) and create a random performance issue.

I have zero ideas how any of these work underneath, but out of a wild guess, is there any chance that alignment (or similar) is the issue? e.g. maybe RocksDB tries to align its writes to some boundary in a way that's awful for Optane's technology?

Alignment sounds like a plausible factor, but in my testing I've found that the penalty for unaligned writes on the Optane P4800X is less than a factor of two.

Hadn't really thought of that. I suppose we could have retried by making a single partition taking up the entire space, and letting fdisk take care of aligning the start of the partition. But I assumed that alignment wouldn't be an issue when using the unpartitioned block device.

Write latency can be lower on flash-based SSDs than Optane because the former tend to have DRAM caches, but Optane SSDs have little to no caching. The difference isn't enough to produce that large of a discrepancy, so I suspect that the data being written to the Optane SSD was either not being batched into blocks of the same size as for the flash SSDs, or that fewer operations were being queued simultaneously for the Optane SSDs. I wonder if there were some mkfs or mount options that defaulted to a different value for the Optane SSD. I'm also curious whether the NVMe driver was used in polling mode.

Since the loader was performing asynchronous writes, actual batching would be whatever the OS is doing when it eventually flushes dirty pages to the underlying device. I can't imagine the OS using different queue depths during the RocksDB run than the LMDB run. mkfs/mount options were identical for both. We didn't touch the NVMe driver at all, so whatever its default behavior is, was used on all tests.

By the way - I think you can still get free access to run your own tests. https://twitter.com/IntelStorage/status/1010284314121129985

Let me know if you want to try to rerun these, and doublecheck the results. Also, we found that 4.x kernels performed significantly better than 3.x kernels on these tests.

I already have the hardware needed to run my own tests (albeit only one P4800X), but I've been almost exclusively focusing on consumer SSDs recently. I'll have some time later this month to try to replicate your results. I don't have experience setting up RocksDB or LMDB. Is the process of getting your benchmark running pretty self-explanatory?

All the shell scripts I used are in the data.tgz file linked at the bottom of the page. You should be able to run after editing a few pathnames. Compiling the dbbench drivers should be simple enough; I may be able to send you binaries if not.

> With LMDB on the raw block device, each write of a record results in an immediate write to the device

Shouldn't they use mmap instead of syscalls for byte-addressable storage to avoid switches into kernel space?

NVMe SSDs aren't byte addressable even when their underlying storage medium is. Optane SSDs use 512B or 4kB sectors just like any other block storage device. Actual byte-addressable Optane DIMMs are just starting to become available, though only to major cloud computing providers so far.

Using mmap with infrequent msync calls would mean you're running with looser data consistency guarantees. That may be suitable for some use cases, but it doesn't necessarily make for fair benchmarking.

The loader doesn't perform any sync calls. It really doesn't need to since the DB is larger than RAM - eventually the FS cache fills and then every newly written page will force an existing dirty page to get flushed.

To answer the parent post, when using the raw block device we're actually using mmap already. That's what LMDB means: Lightning Memory-Mapped Database. And while not all raw devices support read/write calls, if they support mmap we use it. But writing thru mmap actually performs poorly for larger-than-RAM DBs. Whenever you access a new page, the OS takes a page fault to page it in from storage first. It's a wasted I/O in this case because we're about to overwrite the entire page.

What are you doing differently in this test that your writes/sec number for RocksDB is only 5k, whereas your previous benchmarks several years ago show 100k?

Excellent question. Several years ago RocksDB didn't support ACID transactions. This test turns on transaction support in RocksDB.

Can't wait to see Optane as RAM tests using lmdb.

I hope to have the next writeup done in a couple days...

I wasn't aware that the Xeon Gold CPUs (61xx instead of Xeon E5 or E7 x6xx models) also support the Intel Memory Drive Technology. If yes, then great! Or are you going to use a different server for the IMDT test?

The same server is used for the IMDT tests. From what I saw, IMDT is basically a thin hypervisor that maps the SSD into the address space and then handles the page faults before the main OS ever sees them. I don't think this requires anything that any arbitrary CPU/MMU couldn't handle.

I don't think MDT has any particular hardware requirements beyond the usual modern virtualization feature set. It seems to be supported on all LGA2011[x] Xeons going back to Ivy Bridge, and the current Xeon Scalable product line (Platinum, Gold, Silver, Bronze) on LGA3467. I wouldn't be surprised if it could be made to work on the desktop platforms, too, but it probably wouldn't be worth the trouble.

Optane IMDT test results are now published http://www.lmdb.tech/bench/optanessd/imdt.html

thanks for the update. its boost to the read throughput looks amazing.

I can't wait until we see Optane DIMMs come to market. It's no surprise that Optane doesn't perform great as an SSD when every part of the stack is so highly optimized to minimize the impact of SSD limitations.

We'll have test results on Optane DIMMs posted "soon" - as soon as the NDA expires...

Did you use the cache-line flush instructions directly or use some higher-level package from the PMDK for adding byte-addressable NVM support to LMDB?

Nope, nor was any of that necessary. LMDB works reliably with Optane DIMMs already, unmodified, by default. By default, LMDB uses a read-only mmap and write() syscalls, so it's the OS's job to persist the data. If you used the MDB_WRITEMAP option to write through the mmap, then you would indeed need to add explicit cache-line flush instructions. But that's not the recommended way to use LMDB.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact