You're right, I was talking about LevelDB. But in your benchmark MDBM is 2x fast...

hyc_symas · on Dec 17, 2014

> Any idea why it is that much better than LMDB?

Yes, for exactly the reasons you'd expect:

  mdb_stat /tmp/leveldbtest-1000/dbbench_mdb-6/
  Status of Main DB
    Tree depth: 4
    Branch pages: 204
    Leaf pages: 31250
    Overflow pages: 0
    Entries: 1000000

As you said, MDBM can find any record in 2 seeks; for this database the LMDB tree height is 4 so any random access takes 4 seeks. 2x perf difference.

On a larger DB we would expect MDBM's random read perf advantage to get larger as well, until the DB exceeds the size of RAM. I've tried to duplicate my http://symas.com/mdb/ondisk/ tests with MDBM but it makes XFS lose its mind by the time the DB gets to 2x the size of RAM. First I had to increase the MDBM page size from my default of 4KB to 128KB, otherwise I'd see a lot of this in the output:

  2014/12/16-13:13:29 ... thread 0: (200000,12200000) ops and (3436.0,8072.2) ops/second in (58.206750,1511.358831) seconds
  3:54903002:dc057:00864 mdbm.c:1809 MDBM cannot grow to 33554432 pages, max=16777216

but on this VM with 32GB RAM, every time the DB hit 60GB in size the kernel log would start getting spammed with

  Dec 16 22:19:19 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:21 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:23 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

So thus far I've been unable to load the 160GB DB to reproduce the test. But this underscores the basically uncacheable nature of hashes - once the directory gets too big to fit in RAM, your "2 seeks per access" goes out the window because the kernel is thrashing itself trying to keep the whole directory in-memory while finding the requested data pages.

B+tree performance degrades gracefully as data volumes increase. Hash performance falls off a cliff once you cross the in-memory threshold.

luckydude · on Dec 17, 2014

So for creating the DB you are correct though you should be able to size the DB such that you aren't splitting and copying all the time if you have an idea of how much data you are going to put in.

The "2 seeks per access" absolutely does NOT go out the window. The whole point was that this works for any size DB. It's a function of the size of the pages, the keys, and the values, for a run with key+val @ 20 bytes and 8K pages, 25 million entries had a directory of 16KB. You hash the key, walk however much of the 16KB you need to find the page, and then you go to that page and only that page.

2 seeks for any lookup for any sized DB. The directory could be better but you walk sequentially (hell, map that and lock it in, then it is 1 seek for any lookup).

The reason you are thrashing XFS so much is we're growing the directory and the number of pages. Each time you split a page you have to copy to the new pages.

hyc_symas · on Dec 17, 2014

For completeness' sake... I've tried a dozen runs at this already, looking for tuning parameters to tweak. It always starts well and then degrades rapidly before finally hanging.

  MDBM:       version 4.11.1
  Date:       Tue Dec 16 20:34:36 2014
  CPU:        16 * Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz
  CPUCache:   20480 KB
  Keys:       16 bytes each
  Values:     2000 bytes each (1000 bytes after compression)
  Entries:    76800000
  RawSize:    147656.2 MB (estimated)
  FileSize:   74414.1 MB (estimated)
  ------------------------------------------------
  2014/12/16-20:34:38 ... thread 0: (200000,200000) ops and (128923.6,128923.6) ops/second in (1.551306,1.551306) seconds
  2014/12/16-20:34:39 ... thread 0: (200000,400000) ops and (110571.0,119044.1) ops/second in (1.808792,3.360098) seconds
  2014/12/16-20:34:42 ... thread 0: (200000,600000) ops and (87921.9,106480.3) ops/second in (2.274746,5.634844) seconds
  2014/12/16-20:34:44 ... thread 0: (200000,800000) ops and (107459.0,106723.3) ops/second in (1.861175,7.496019) seconds
  2014/12/16-20:34:46 ... thread 0: (200000,1000000) ops and (90923.1,103138.7) ops/second in (2.199660,9.695679) seconds
  2014/12/16-20:34:50 ... thread 0: (200000,1200000) ops and (51852.2,88542.6) ops/second in (3.857118,13.552797) seconds
  2014/12/16-20:34:53 ... thread 0: (200000,1400000) ops and (57516.7,82207.6) ops/second in (3.477253,17.030050) seconds

... by the end

  2014/12/16-21:18:19 ... thread 0: (200000,22400000) ops and (858.0,8541.6) ops/second in (233.094661,2622.447456) seconds
  2014/12/16-21:19:19 ... thread 0: (200000,22600000) ops and (3321.7,8424.5) ops/second in (60.210362,2682.657818) seconds
  2014/12/16-21:20:28 ... thread 0: (200000,22800000) ops and (2880.3,8284.6) ops/second in (69.437524,2752.095342) seconds
  2014/12/16-21:21:31 ... thread 0: (200000,23000000) ops and (3203.6,8171.9) ops/second in (62.429181,2814.524523) seconds

then a stream of these start showing up in dmesg

  Dec 16 21:14:09 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 21:14:30 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 21:14:32 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 21:14:34 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 21:14:36 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

and on and on until I kill the job.

  Dec 16 22:19:01 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:03 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:05 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:07 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:09 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:11 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:13 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:15 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:17 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:19 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:21 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 16 22:19:23 localhost kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
  Dec 17 00:14:16 localhost systemd: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 1372 (vmtoolsd)

LMDB does this load in 8 minutes at an effective data rate of ~283MB/sec. Pretty much as fast as the hardware will stream (peak throughput of 300MB/sec in this VM). I don't see how you can possibly use MDBM as an unreliable data store if it could potentially take hours to reload the data.

hyc_symas · on Dec 18, 2014

Just to be clear - the loading job keeps running when the kernel error messages first start happening at 21:14:09. But after 21:21:31 the loading job makes no more progress. The kernel error messages continued until 22:19:23 when I killed the job, but nothing further got written to the DB in that intervening hour (based on looking at the filesize, which admittedly may not tell the whole story). The job stats were set to output once per 200,000 ops. It's of course possible that it was still making some progress, but dropping from 3203 ops/second to something less than 55/second is pretty drastic, and at that rate it would have taken weeks to complete the load.

As another note, I also tried to use mdbm_pre_split() at the beginning of the job. That churned for 10 minutes at the beginning of the job before adding any records, then started processing at its normal speed, and then quickly degraded into the same state.

cbsmith · on Dec 19, 2014

Did you get an idea as to the cause of the problem? I bet that'd be of interest to either the mdbm maintainers or the kernel developers.

cbsmith · on Dec 17, 2014

I apologize. I reread my comments and they come off as downright incendiary which was not what I intended.

At least back when I was using it, there is a bit of a science to tuning mdbm correctly for your data. In general, it is a tool designed to allow you to tweak with whatever application level knowledge you have of your data's structure, and as a consequence it can be terribly suboptimal out of the box. Depending on circumstances, even if you don't need disk level transactional integrity, LMDB may indeed be a be better choice.

That said, the test case you are describing is definitely for a use case for which mdbm is suboptimal. The results you are getting are if anything surprisingly good under the circumstances, and frankly I'd have never even considered using mdbm for that kind of work load (LMDB would definitely be one of the first choices I'd consider for that kind of work load).

As you've mentioned, a hashtable based key-value store is fairly suboptimal for disk based storage (though depending on circumstnace and how you tweak it, a hash table with mdbm can work surprisingly well with an SSD based store), and the numbers you are presenting seem if anything better than I'd expect.

The reports you are seeing with XFS seem... odd, and almost feel like either a simple issue with a bug in how mdbm is talking to mmap/vfs layer, or more likely within XFS's implementation, but it doesn't seem like they are slowing you down.

In general, mdbm is most useful for storing and accessing compact rows in memory across potentially many processes with a random access pattern, which IMHO is not the problem space that you are testing and tackling with LMDB, LevelDB, or many of the others. One can (rightly) argue that that is a fairly narrow and simple problem space, but as with almost anything in computer science, doing even fairly narrow and simple things efficiently (and mdbm has a number of clever design choices that help it be efficient) and in an error free fashion is enough trouble that having a standard tool for solving that problem is terribly useful.

MDBM isn't really the only tool for solving that problem out there. Pretty much any shared memory hash table based solution may be a good fit for it, and there are alternative data structures that have desirable advantages over hash tables (critbit based structures are one of my favourite pets for such problems). Heck, in C++ I've used the Boost.IPC library's unordered maps for the job with reasonable results.

I bet there are probably some implementations that perform better than mdbm in certain cases (I'd actually be interested in benchmarks comparing that kind of workload with other tools designed for that problem space). Still, the mdbm codebase is battle hardened and really does perform well as long as your data set size & access patterns don't cause thrashing of the page store.