Hacker News new | comments | show | ask | jobs | submit login

I hate to be a hater.

But the big issue with databases I've worked with is not how many inserts you do per second, even spinning rust, if properly reasoned can do -serious- inserts per second in append only data structures like myisam, redis even lucene. However the issue comes when you want to read that data or, more horribly, update that data. Updates, by definition are a read and a write to commuted data, this can cause fragmentation and other huge headaches.

I'd love to see someone do updates 1,000,000/s

MySQL Cluster (NDB - not Innodb, not MyISM. NDB is the network database engine - built by Ericsson, taken on by MySQL. It is a distributed, in-memory, no shared state DB). >50% of mobile calls use NDB as a Home Location Registry, and they can handle >1m transactional writes per second. We have verified the benchmarks, that you can get millions of transactions/sec (updates are about 50% slower than reads) on commodity hardware with just gigabit ethernet. Oracle claim up to 200m transactions/sec on Infiniband and good hardware: http://www.slideshare.net/frazerClement/200-million-qps-on-c...

Two amazing facts about MySQL Cluster: (1) It's open source (2) It's never penetrated the Silicon Valley Echo Chamber, but is still the world's best DB for write-intensive transactional workloads.

Another reason may be that it seems you need to be an expert in internal workings of the ndb to deal with it. I tried to use it. Over a month ended up with a few cases of "can't start the database", "can't update", etc. with no real solutions. There's not that much information about ndb on the internet, so googling doesn't help. The best help I got was from mysql-ndb irc channel, but in practice when things went bad, people said something like "if you send me the database, I'll help you figure it out". This does not work in practice.

I feel like it would be more popular if people actually wrote about how they're using it. But how do you start when even basic configuration bugs are still open without progress: https://bugs.mysql.com/bug.php?id=28292

(this was a few years ago or so, maybe things changed)

Yes, that's a good few years ago. Stability issues were mostly fixed around 8 years ago. It's now very stable. There are products like www.severalnines.com and MySQL Manager to setup and manage instances for you with a UI. If you want to roll your own, there are chef cookbooks for installing ndb - https://github.com/hopshadoop/ndb-chef .

But if you really want to squeeze performance out of it, read Mikael's blog: http://mikaelronstrom.blogspot.se/ and Frazer's blog: http://messagepassing.blogspot.se/

Some systems (like Cassandra) has upserts and can do writes without reading. Though you loose any kind of "transaction" safety in that you are not sure what you are writing on top of. But in my experience for the vast majority of cases that is ok.

> Some systems (like Cassandra) has upserts and can do writes without reading.

That forces the latest-version resolution into a read-side problem - systems like that cannot handle any sort of complete table-scans but can only really work for a pre-known key.

With a known unique key, this is somewhat possible to use systems like this, but extremely expensive when what you need is BETWEEN '2017-01-01' and '2017-01-02'.

We have a lot of data like that and just use a compound partition key of (year, month, day) tuple and then we can ask stuff on day resolution. We also add exact timestamp into clustering key so you can only ask for more precise data.

With Cassandra (well most nosql dbs that I've run into) you model your data in the db by the queries. So if you want to scan by date then you create a new table (or a materialized view) with the date as the key you query for. Though even for then you have to keep in mind the amount of data to make sure you don't have too much data per partition and hotspots.

Cassandra has 2 components to its primary key - 1 that sets the partition (which includes which nodes get the data), and 1 that sets the clustering/sorting/ordering within that partition.

If you use a clustering key, scans become trivial within a given partition.

depends on your structure. you could scan though dates as a partition's cluster key for example

You are so right. Without giving to much detail the wife is working on a project with a branch of government. They use Oracle. Her teams biggest problems are and have been the DB falling over during updates or joins. Were talking billions of rows of government data. Running out of table space, loads taking literally days to complete. Thing's that I have never seen happen. I'm not sure if it's the DB architecture itself or clueless people doing really dumb things. I only get to see these failures second hand.

If they upgrade to all in-memory databases their full-table scans will run faster


See LMDB (embedded) and ScyllaDB (distributed):

The Lightning Memory-Mapped Database (LMDB) https://www.youtube.com/watch?v=Rx1-in-a1Xc

ScyllaDB http://www.scylladb.com/

Also Apache Geode (distributed, in-memory, consistent, durable, crazy fast) -- seriously solid tech.


Also SnappyData which uses code from geode.

Not really a database per se, but trading system matching engines can maintain updates in excess of 1M/s. Of course these are small 60 byte messages updating 10-15 bytes of a given order book in something like a RB-tree, so it's not as impressive.

What is impressive I guess is that they can sustain 1M updates/sec at a 99.9th percentile update round-trip of ~10 microseconds or so, with medians at around ~4 microseconds.

Since each ingest process talked directly to the Accumulo tablet local to it, it really measured loopback+RPC+DFS performance. Knowing how these things usually go, it might have been 100M rows/s but only 100k-1M RPCs/s. It's still quite impressive, but it's important to keep it in perspective. For example, I believe Google's C* 1M writes/s demo also included real network overhead from driver processes. Additionally, that was with the WAL on, vs. this Accumulo run which disabled the WAL.

Our graph store (HBase, SSD) on 10 nodes can easily support 3M edges/s read/stored, but thats ~40k RPCs/s given our column sizes and average batch size.

> However the issue comes when you want to read that data or, more horribly, update that data.

A log structure for your database would make the update case more similar to the append case, wouldn't it?

(There are definite limits to that technique, but it does work for eg some file systems---which are also a type of database.)

A transaction log needs to be cleaned up, which takes time. You might amortize cleanup over many small objects.

> but it does work for eg some file systems---which are also a type of database

No mainstream FS uses the txn log as a primary store. The log(s) are only around for active writes to metadata, as soon as the second write is out the spot in the log can/will be reused. Similar to the clean up case FSes try to batch as many ops as they fit into their log(s) before flushing the transaction.

Apart from the CoW herd (ZFS, btrfs, HAMMER) most FSes either can't or won't journal data by default anyway. -- Which is a fairly often overlooked point, many people seem to assume that a journaling FS means that everything, include their application data, is journaled, which isn't just wrong, but can also be a rather dangerous assumption; depending on application.

Thanks for giving more background to my vague hunches!

In a key value store? That can be done at 1,000,000/s per core.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact