I sorta get that, but it's not like Postgres/MySQL/etc are operational nightmare...

aphyr · on May 14, 2012

it's not like Postgres/MySQL/etc are operational nightmares

Every database is an operational nightmare. Just wait until MySQL segfaults every time it gets more than 4 SSL connections at once, or takes four hours to warm up its inno cache. Or Cassandra compaction locks up a node so hard its peers think its down, turning your cluster into a recursive tire fire of GC. Or a Riak node stops merging bitcask and consumes all your disk in a matter of hours. Or Mongo ...

My point is that every database comes with terrifying failure modes. There is no magic bullet yet. You have to pick the system which fits your data and application.

lwat · on May 14, 2012

You don't seem to have any experience with solid, mature RDBMSs like PostgreSQL or Oracle or SQL Server. MySQL sucks in so many ways and all the other names you mentioned are not RDBMSs. People running large numbers of (for example) MS SQL Servers will tell you that they are ridiculously robust and mature and do NOT randomly crash in the middle of the night. And even when your hardware lets you down your hot failover machine is ready to take over without skipping a beat.

Magic bullet? These systems are the closest you can find in the world of software.

aphyr · on May 14, 2012

<sigh> Or when the Oracle SCN advances too quickly, causing random systems to refuse connections or crash outright. Just because software is large, supported, and mature doesn't mean it is free of serious design flaws.

[Edit:] Don't get me wrong, Oracle's DBs are serious business; an incredible feat of engineering. If I could afford them I'd probably use them more often. But everything in this business is a cost tradeoff--in licenses, in dev time, in risk.

aphyr · on May 15, 2012

Aware Postgres is free: it's an excellent database and I've used it successfully (like most of the other tools I've mentioned.) It also has spectacular failure modes; though the only ones I've personally encountered were around vacuuming.

lwat · on May 15, 2012

So if these incredibly mature and robust systems are an 'operational nightmare' I'd like to hear what software systems ARE up to your operational standards?

How often did you manage to spectacularly fail PostgreSQL to qualify it an 'operational nightmare'?

aphyr · on May 15, 2012

http://news.ycombinator.com/item?id=3974403 (reply threshold is high enough for me to post now)

whatusername · on May 15, 2012

I wonder if NonStop SQL ( http://en.wikipedia.org/wiki/NonStop_SQL ) might count..

lwat · on May 14, 2012

FYI PostgreSQL is free.

willbmoss · on May 14, 2012

I'll agree they are not operational nightmares, but now that we're set up with Riak we can do things that I'm pretty sure your Postgres/MySQL/etc. setup cannot.

1. Add individual nodes to the cluster and have it automatically rebalance over the new nodes.

2. Have it function without intervention in the face of node failure and network partitions.

3. Query any node in the cluster for any data point (even if it doesn't have that data locally).

I'm sure there's other things I'm missing, but the point made by timdoug is the key one. We're at a scale now where it's worth trading up-front application code for reliability and stability down the line.

dolinsky · on May 15, 2012

Overall I enjoyed reading the article and the specific tradeoffs you had to consider when comparing Riak to Mongo for your specific use cases. I'm curious if you had any problems w/r/t the 3 items you highlighted above when using Mongo, as none of those were mentioned in the linked article.

(these points are for MongoDB 2.0+)

1. Adding a new shard in Mongo will cause the data to be automatically rebalanced in the background. No application-level intervention required.

2. Node failure (primary in a RS) is handled without intervention by the rest of the nodes in the cluster. Network partitions can be handled depending on what the value set for 'w' is (in practice it's not an isolated problem with a specific solution).

3. Using mongos abstracts the need to know what node any data is on as each mongos keeps an updated hash of the shard keys. Queries will be sent directly to the node with the requested data.

willbmoss · on May 15, 2012

I agree with you in theory, in practice, my experience has been a bit different. Specifically,

1. Since Mongo has a database (or maybe collection now) level lock, doing rebalancing under a heavy write load is impossible.

3. Mongos creates one thread per connection. This means that if you've got to be very careful about the number of clients you start up at any given time (or in total).

taligent · on May 15, 2012

1. Why would you be rebalancing under heavy write load. Wouldn't it be scheduled for quieter periods ?

2. I was under the impression that almost all Mongo drivers had connection pools.

kermatt · on May 15, 2012

> Why would you be rebalancing under heavy write load. Wouldn't it be scheduled for quieter periods ?

I think the example case is when a node fails during a load period - unplanned vs planned.

Some of the public MongoDB stories come from shops that were forced to rebalance during peak loads because they failed to plan to do so in a maintenance window.

Of course waiting to scale out until _after_ it becomes a need will result in pain (of various degrees) no matter what the data storage platform.

zorkian · on May 15, 2012

Yes, Riak gives you certain flexibilities that make certain things easier. If a node dies, you generally don't have to worry about anything. Stuff Just Works. (Well, in the Riak case, this is true until you realize that you have to do some dancing around the issue that it doesn't automatically re-replicate your data to other nodes and relies instead on read repair. This puts a certain pressure on node loss situations that I find is very similar to traditional RDBMS.)

But of your list, I have done all of these things in a MySQL system and for a comparable 1000 lines of code.

1. We implemented a system that tracks which servers are part of a "role" and the weights assigned to that relationship. When we put in a new server, it would start at a small weight until it warmed up and could fully join the pool. Misbehaving machines were set to weight=0 and received no traffic.

2. Node failure is easy given the above: set weight=0. This assumes a master/slave setup with many slaves. If you lose a master, it's slightly more complicated but you can do slave promotion easily enough: it's well enough understood. (And if you use a Doozer/Zookeeper central config/locking system, all of your users get notified of the change in milliseconds. It's very reliable.)

Network partitions are hard to deal with for most applications more so than most databases. It is worth noting that in Riak, if you partition off nodes, you might not have all data available. Sure, eventual consistency means that you can still write to the nodes and be assured that eventually the data will get through, but this is a very explicit tradeoff you make. "My data may be entirely unavailable for reading, but I can still write something to somewhere." IMO it's a rare application that can continue to run without being able to read real data from the database.

3. In a master/slave MySQL environment you would be reading from the slaves anyway unless your tolerance for data freshness is such that you cannot allow yourself to read slightly stale data. I.e., payment systems for banks or things that would fit better in a master/master environment. Since the slaves are all in sync, broadly speaking, you can read from any of them. (But you should use the weighted random as mentioned in the first point.)

...

Please also note that I am not trying to knock Riak. It's neat, it's cool, it does a great job. It's just a different system with a different set of priorities and tradeoffs that may or may not work in your particular application. :)

But to say that it can do things the others can't is incorrect. Riak requires you to have lots of organizational knowledge about siblings and conflict resolution in addition to the operational knowledge you need. A similar MySQL system requires you to have a different set of knowledge -- how to design schemas, how to handle replication, load balancing, etc.

Is one inherently better than the other? I don't think so. :)

aphyr · on May 15, 2012

Can't reply any deeper, figure this is relevant...

Databases are, in my experience and those of my peers, the aspect of a system most likely to crash and burn. On the surface, it looks like regular old bugs--every project has them. But there's a reason databases are so hard: they're the perfect storm of unreliable systems, bound together with complex and unpredictable code to try to present a coherent abstraction. Specifically:

1. DB's need to do disk IO for persistence, which pits them against one of the slowest, most-likely-to-fail components of modern hardware.

2. They must maintain massive, complex caches: one of the hardest problems in computer science in general.

3. They must be highly concurrent, both for performance on multicore architectures and to service multiple clients.

4. Query optimization is a dark art involving multiple algorithms and frankly scary heuristics; the supporting data structures can be quite complex in their own regard.

5. They need to send results over the network, an even higher latency and less reliable system.

6. They must present some order of events: whether ACID-transactional, partially ordered with vclocks, etc.

Almost all these problems interact with each other: a logically optimal query plan can end up fighting the disk or network; concurrency makes data structures harder, networks destroy assumptions about event ordering, caches can break the operating system's assumptions, etc.

Moreover, the logical role of a database puts it in a position to destroy other services: when it fails, shit hits the fan, everywhere. Recovery is hard: you have to replay logs, spool up caches, hand off data, resolve conflicts. Yes, even in mostly-CP systems like MSSQL. All of these operations are places where queues can overflow, networks can fail, disks can thrash, etc.

Furthermore, the quest for reliability can involve distributing databases over multiple hosts or geographically disparate regions. This distribution causes huge performance and concurrency penalties: we're limited by the speed of light for starters, and furthermore by the CAP theorem. These limits will not be broken barring a revolution in physics; they can only be worked around.

The only realistic approaches (speaking broadly) are CP (most traditional DBs) and AP (under research). CP is well-explored, AP less so. I expect the AP space to evolve considerably over the next ten years.

cbsmith · on May 15, 2012

First, I'm going to quibble with #5. The networks databases typically must communicate over are typically much lower latency and are at the very least far less subject to spontaneous hardware failure (human error is another matter that I'll concede is certainly debatable as that has a lot to do with the complexity of your storage and network systems...).

From elsewhere (lost the link) you wrote:

> My point is that every database comes with terrifying failure modes. There is no magic bullet yet. You have to pick the system which fits your data and application.

I'm totally with you in this part. "State" is the magic from whence most (all?) software bugs seem to stem. This has lead to emergent design time behaviour such that as few of the components as possible manage the authoritative logical state of our systems. It lead to it being desirable for those few remaining stateful components to be as operationally resilient in the face of adversity as conceivable. These forces have conspired to ensure herculean efforts to make such systems resilient and perfect, almost as much as they have conspired to ensure those components present the most impressive catastrophic failure scenarios. ;-)

While on one hand I completely agree that one needs to select the system which best fits your data and application, the maturity of more established solutions ensures they've been shaped by a far more robust set of different contexts. They've crossed off their list far more "fixing corner case X is now our top priority" moments than those who have come since. They really can better address a much broader range of the needs of at least applications that have been developed to date. By comparison most "NoSQL" stores have advantages in a fairly narrow set of contexts. Particularly if you are a startup and really don't know what your needs will be tomorrow, the odds favour going with as high a quality RDBMs as you can get your hands on.

I really do think far too few startups consider the reality that the thought process should probably be, "Are the core technical problems that stem from our long term product goals fundamentally not well suited to solutions available with RDBMS's?", and only asking about alternatives if the answer is no.

Another way to look at is is, from 1-6, only #4 could be argued as not being applicable to a file server (I'd throw in SAN too, but those are so often lumped in with RDBM's, so let's skip that one)? And really, #4 is both simpler and harder for filesystems: they have a far more trivial interface to contend with... but consequently have a far more difficult time determining/anticipating query access patterns for selecting optimal data layout, data structures and algorithms to maximize IO efficiency.

Why don't we worry as much about choosing a filesystem to suit our app as we do a database?

aphyr · on May 15, 2012

First, I'm going to quibble with #5. The networks databases typically must communicate over are typically much lower latency and are at the very least far less subject to spontaneous hardware failure (human error is another matter that I'll concede is certainly debatable as that has a lot to do with the complexity of your storage and network systems...).

I really wish this were true man! For many shops it is, but under load, even local networks will break down in strange ways. All it takes is dropping the right syn packet and your connect latency can shoot up to 3 seconds. It's also common to have a database system with 0.1+ light-second distances between nodes, which places a pretty firm lower bound on synchronization time.

I'm totally with you in this part. "State" is the magic from whence most (all?) software bugs seem to stem.

Absolutely agree; databases are simply the most complicated state monads we use. I actually count filesystems as databases: they're largely opaque key-value stores. Node-networked filesystems like coda, nfs, afs, polyserve, whatever that microsoft FS was... those are exactly databases, just optimized for a different use case and with better access to the kernel. SANS are a little less tricky, at least in terms of synchronization, but subject to similar constraints.

While on one hand I completely agree that one needs to select the system which best fits your data and application, the maturity of more established solutions ensures they've been shaped by a far more robust set of different contexts. They've crossed off their list far more "fixing corner case X is now our top priority" moments than those who have come since. They really can better address a much broader range of the needs of at least applications that have been developed to date. By comparison most "NoSQL" stores have advantages in a fairly narrow set of contexts.

You're totally right; traditional RDBMSs have benefited from exhaustive research and practical refinement. Strangely, most other old database technologies have languished. Object and graph stores seem to have been left behind, except for a few companies with specific requirements (linkedin, twitter, google, FB, amazon, etc). AP systems in general are also poorly understood, even in theory: the ongoing research into CRDTs being a prime example.

So to some extent the relatively new codebase of, say, Riak contributes to its horrible failures (endless handoff stalls, ring corruption, node crashes, etc). Those will improve with time; I've watched Basho address dozens of issues in Riak and it's improved tremendously over the past three years. The fundamental dynamo+vclock model is well understood; no major logical concerns there.

Now... the AP space seems pretty wide open; I would not at all be surprised to see, say, distributed eventually-consistent CRDT-based, partial-schema relational, kv, or graph DBs arise in the next ten years. Lots of exciting research going on. :)

Why don't we worry as much about choosing a filesystem to suit our app as we do a database?

If I had to guess, it's because non-esoteric local filesystems offer pretty straightforward capabilities; no relational algebra, no eventual consistency, etc. If you have access to ZFS it's a no-brainer. If you don't, there are decent heuristics for selecting ext4 vs XFS vs JFS et al, and the options to configure them (along with kernel tweaks). The failure modes for filesystems are also simpler.

That said, I agree that more people should select their FS carefully. :)

lwat · on May 15, 2012

This doesn't really answer any of the questions I asked you but from what you wrote here you have to agree that going with a tried-and-true RDBMS that's been around for decades is a much better prospect than choosing one of the new, unproven NoSQL products.

aphyr · on May 15, 2012

Sorry, was trying to address underlying concerns rather than specific software.

you have to agree that going with a tried-and-true RDBMS that's been around for decades is a much better prospect than choosing one of the new, unproven NoSQL products.

I disagree. Like I said, CP vs AP are two very different strategies for addressing mutable shared state, and there are no AP RDMBSs that I'm aware of. If your system dynamics are fundamentally AP, no number of Oracle Exadata racks will do the job. If your dynamics are fundamentally CP, Riak, SimpleDB, and Cassandra will make your life a living hell. Few applications are all-AP or all-CP, which is why most teams I know use a mixture of databases.