There are basically two usecases for LSM trees vs B+trees:
- Either you have a lot of random new writes and not so many updates, in which case LSM trees will ingest new data as fast as the disk can store them
- Or you care more about read performance, and LMDB and BoltDB will have more predictable (and arguably better) performance
InfluxDB is a timeseries database, so they write a lot of stuff and don't even read all of it. As data gets older it can be pruned efficiently with an LSM-based design, not so easily with B+tree. Dgraph on the other hand seems to sit right in the middle as it wants to be a general purpose database so there's no easy winner here. Hopefully the choice was correct for most use cases.
They're based on different technologies. Bolt uses B-Trees and Badger/Rocks/Leveldb use LSM trees.
Bolt also uses insane amounts of RAM and writes get slower and slower as the size of the database increases. (Personal experience, don't have benchmarks. Take this at face value.)
Thoughts on implementing transactions on top of Badger? I can see a bunch of ways, but it seems you'd have to build an entire high-level transaction model on top in order to get isolation, read consistency and atomicity. You'd also want to do a lot of caching, I think, to make it fast.
If huge, long-running transactions aren't needed, then it may be easier to just let transactions happen in RAM, with a cache to enforce isolation and emulate atomicity. So for exmaple, while a transaction is committing, the system would need to cache the previous (that is, currently committed) version of every key, to avoid having other transactions reading the on-disk data, which is in the process of being updated. It would also need either a redo log or undo log so that, in the event of a cache, any half-written transactions can be either replayed or rolled back.
I'm curious if you have any more details here, or comparable benchmarks to share. BoltDB uses B+ trees and you say that a B+ tree approach is worth investigating due to improvements in SSD random write performance, so does BoltDB falsify that hypothesis or do you think it's just not well implemented and the idea still has potential?
We have tried with BoltDB. Its performance is really bad. It is just badly implemented, acquires a global mutex lock across all reads and writes. We wouldn't have written Badger if BoltDB worked for us.
RocksDB performs much better and is the most popular and efficient KV store in the market, being used at both Google (Leveldb) and Facebook. Therefore, the benchmarks are against that. Without spending time generating benchmarks, I'll bet that Badger would beat BoltDB any day.
The idea of using B+-trees with value log has potential. One would need to do some obvious optimizations to decrease the frequency of writes to disk. Because SSDs have to run garbage collection cycles, which can affect write performance in a long running task.
But, I think it would make a great research project. And if it comes out to be better than our current approach of LSM tree in read-write performance, I'd switch in a heartbeat; because I think B+-tree might be a simpler design.
> We have tried with BoltDB. Its performance is really bad.
BoltDB author here. I agree with you that Badger would beat Bolt in many performance benchmarks but it's an apples and oranges comparison. LSM tree key/value stores are write-optimized and generally lack transactional support. BoltDB is read-optimized and supports ACID transactions with serializable isolation. Transactions come with a cost but they're really important for most applications.
Regarding benchmarks, LSMs typically excel in random and sequential write performance and do OK with random read performance. LSMs tend to be terrible for sequential scans since levels have to be merged at query time. B+trees are usually terrible at random write performance but can do well with sequential writes that are batched. They tend to have good random read performance and awesome sequential scan performance.
It comes down to using the right tool for the job. If you don't need transactions, Bolt is probably overkill. There's whole section on the BoltDB README about when not to use Bolt:
> It is just badly implemented, acquires a global mutex lock across all reads and writes.
You're welcome to your opinion about it being "badly implemented" but the global mutex lock across reads and writes is simply untrue. Writes are serialized so those have a database-wide lock. However, read transactions only briefly take a lock when they start and again when they stop so they can obtain a snapshot of the root node. That gives the transaction a point-in-time snapshot of the entire database for the length of the transaction without blocking other read or write transactions.
Very interesting, thanks for the answer! For the application and data structure I have in mind, I expect to need lots of range iteration over small values (<32 bytes?) and relatively infrequent writes, so Badger might not be a good fit. At a minimum sounds like I'd need to set up my own benchmarks.
For our system database[0] I am using BoltDB. It is a search engine with the database updated nearly in batch, so write performance is a non issue. The Go/BoltDB custom index has been running without any issues under moderate load for 2 years and the performance is great (a molecule information page can be delivered in less than 50ms to the end user at 99.9% percentile).
So, you should really test your read/write ratio with the size of the keys and payloads before selecting one solution or another. Even on the Badger tests, you can see that it can vary a lot.