But the big issue with databases I've worked with is not how many inserts you do per second, even spinning rust, if properly reasoned can do -serious- inserts per second in append only data structures like myisam, redis even lucene. However the issue comes when you want to read that data or, more horribly, update that data. Updates, by definition are a read and a write to commuted data, this can cause fragmentation and other huge headaches.
I'd love to see someone do updates 1,000,000/s
Two amazing facts about MySQL Cluster:
(1) It's open source
(2) It's never penetrated the Silicon Valley Echo Chamber, but is still the world's best DB for write-intensive transactional workloads.
I feel like it would be more popular if people actually wrote about how they're using it. But how do you start when even basic configuration bugs are still open without progress: https://bugs.mysql.com/bug.php?id=28292
(this was a few years ago or so, maybe things changed)
But if you really want to squeeze performance out of it, read Mikael's blog:
and Frazer's blog:
That forces the latest-version resolution into a read-side problem - systems like that cannot handle any sort of complete table-scans but can only really work for a pre-known key.
With a known unique key, this is somewhat possible to use systems like this, but extremely expensive when what you need is BETWEEN '2017-01-01' and '2017-01-02'.
With Cassandra (well most nosql dbs that I've run into) you model your data in the db by the queries. So if you want to scan by date then you create a new table (or a materialized view) with the date as the key you query for. Though even for then you have to keep in mind the amount of data to make sure you don't have too much data per partition and hotspots.
If you use a clustering key, scans become trivial within a given partition.
The Lightning Memory-Mapped Database (LMDB) https://www.youtube.com/watch?v=Rx1-in-a1Xc
What is impressive I guess is that they can sustain 1M updates/sec at a 99.9th percentile update round-trip of ~10 microseconds or so, with medians at around ~4 microseconds.
Our graph store (HBase, SSD) on 10 nodes can easily support 3M edges/s read/stored, but thats ~40k RPCs/s given our column sizes and average batch size.
A log structure for your database would make the update case more similar to the append case, wouldn't it?
(There are definite limits to that technique, but it does work for eg some file systems---which are also a type of database.)
> but it does work for eg some file systems---which are also a type of database
No mainstream FS uses the txn log as a primary store. The log(s) are only around for active writes to metadata, as soon as the second write is out the spot in the log can/will be reused. Similar to the clean up case FSes try to batch as many ops as they fit into their log(s) before flushing the transaction.
Apart from the CoW herd (ZFS, btrfs, HAMMER) most FSes either can't or won't journal data by default anyway. -- Which is a fairly often overlooked point, many people seem to assume that a journaling FS means that everything, include their application data, is journaled, which isn't just wrong, but can also be a rather dangerous assumption; depending on application.