As we don’t use RocksDB but LMDB, we use a lot less real memory than key-value stores that uses a user-side cache system. LMDB is memory mapped and therefore let the OS manage memory for it. Typesense uses RocksDB and ElasticSearch a custom key-value store, used by Lucene internally.
The real advantage of LMDB is that it is a BTree, key-values are ordered and do not need any computing when retrieved which is not the case of a LSM-Tree key-value store like RocksDB that needs to merge/compact pages of key-values pairs before being able to return it too you. Wasting CPU when the search engine must use its CPU to do union/intersection…
Another advantage of LMDB is that it returns a view into the DB itself of the entries, RocksDB can’t as it must do operations on the entries before returning them to the library user, for example: decompressing or compacting the values.
LMDB and RockDB are both great projects, but how much performance you can get out of them would depend on the larger architecture of the system they are being integrated into. Both projects provide tens of flags to adjust read/write amplification to customize them for a particular use case. You will often see public benchmarks of these systems being updated with suggestions from both sides on less-than-optimal configurations being used!
In the case of Typesense, RocksDB is not even a top-10 contributor to the overall latency involved in serving the result. In any case, it would be good to clarify a few things:
> As we don’t use RocksDB but LMDB, we use a lot less real memory than key-value stores that uses a user-side cache system.
Typesense stores only the raw-data in RocksDB. All indexing data structures for filtering, faceting etc. are compact in-memory data structures stored outside. The only fixed memory cost from RocksDB is an in-memory table that is used to buffer writes (see the next point) before being flushed to disk. In practice, this is a trivial percentage of memory used when compared to other data structures.
> LSM-Tree key-value store like RocksDB that needs to merge/compact pages of key-values pairs before being able to return it too you
This happens in-memory and is flushed to the disk in batches. Merging of on-disk SST files happens in the background with no real impact on reads. The advantage of this approach though is that it gets you really good batched write throughput [0] (the above caveat on the difficulty of benchmarking applies).
In summary, like all systems, choosing a storage system involves many trade-offs and what really matters is what works best for your architecture.
A previous version of MeiliSearch was using RocksDB and we were having a lot of trouble using it, a lot of setup to do to make sure that we were not killed by the OS due to OOM or even fixing a lot of strange segfault by patching the RocksDB library itself...
RocksDB doesn't support transaction but views in the database, which means that if you are indexing, writing into the database and that any event makes your program to stop unexpectedly, you can't just start your program and use the data like this as it could be corrupted.
This is why at MeiliSearch we prefer using LMDB, even in case of an unexpected crash, a reboot is instant and valid, you just need to restart the indexing you were previously doing and can serve requests to the users with the previous version of the database.
Also as you can see [0], the benchmarks between LMDB and RocksDB is very clear. I understand that it is maybe not reading the database that takes time on your side but it is on our side, combined with the set operations between sets of internal documents ids.
I see. Perhaps, because we're using the native CPP client we faced no such crash issues with RocksDB. We have also handled transactions at a higher layer.
It's an information that is typically missing yet very important!