Postgres does have (finally!) a nice replication story, so you have data protection on a second server, but the failover mechanics are much more complicated and you still have a single point of failure at the master/writer node. The story gets even more operationally complex when you talk about sharding and the fact that you now have to have a replica per master -- and it really needs to be the same size as the master if you want to be able to trust that you have enough capacity to fail over to it. Suddenly you need to have half of your database capacity sitting idle.
Now, don't get me wrong, I think Postgres is a wonderful database server. For the vast majority of applications it is the correct default choice. Very few applications ever get to the point where they need to scale further than they can get vertically on a single node and most can tolerate a few minutes of downtime to fail over to a backup server. Hand-wavy "sharding is easy" comments, however, ignore a lot of operational reality and that's dangerous.
Understand your use case. Understand the failure modes. Decide how available your datastore needs to be and how much data you need to support. Know the consequences of the choices you make.
So, even if sharding is not easy, you can do quite a lot of it in the time you save by not having to code around the limitations of most noSQL DBMS.
The day will come when RDBMS are secret productivity weapon of smaller companies that don't feel they have to act like they were Google.
That said, there are very good reasons not to use RDBMS in cases where the data model or specific access patterns just don't fit. But in most of those cases, I find that using in memory data structures combined with file system storage or something like BerkeleyDB is a much better fit than any server based DBMS.
Do remember though that, as discussed here recently, most RDBMSs do not act in a fully ACID compliant way by default. IIRC it is providing a complete isolation guarantee often also provides a hefty performance hit so compromises are made in this area unless you explicitly tell it to be as careful as it can. I imagine this can cause quite a nightmare for master<->master replication.
There are a lot of people using "noSQL" options for the wrong reasons (such as to be buzzword compliant, or because they don't understand SQL), but there are issues that traditional RDBMSs have that stick-in-the-muds like me (who cringe at the phrase "eventual consistency" should be more aware of than we generally are.
Making the right choices about your data storage can be hard.
I want to go into this just a little. The issue here is just that there are tradeoffs. Understanding these tradeoffs is what good design is about.
RDBMS's excel at one thing: presenting stored data for multiple purposes. This is what relational algebra gets you and while SQL is not relational algebra it is an attempt to bridge relational algebra with a programming language.
An RDBMS will never be as fast performance-wise as a basic object storage system. However, what it buys you is flexibility for that part of the data that needs to be presented for multiple uses.
So the sort of access pattern that doesn't fit is something like an LDAP server. Here you have a well-defined method for integrating the software, and the use cases for ad hoc reporting usually aren't there.
On the other hand, you have something like an ERP that allows file attachments to ERP objects. Even though the data model doesn't fit exactly, you can't do this gracefully with a NoSQL solution.
So I suggest people think about the need for flexibility in reporting. the more flexibility required, the more important an RDBMS becomes.
Additionally if you have multiple applications hitting the same database, an RDBMS is pretty hard to replace with any other possible solution.
Will this eventually be a problem for NoSQL? Or, is the scalability worth the sacrifice?
Or, does NoSQL typically have only one (main) app, so making it work with a specific storage representation is not a big deal? The relational use-case was many different apps, with different versions, needing different access paths to the data. But if you just have one known set of access paths (like a REST URI), you can just design the DB for that. Hierarchical databases worked well when just one DB, one app; they just weren't very flexible.
Hierarchical databases are fast and simple but inflexible as the relationship is restricted to one-to-many, only allowing for one parent segment per child. http://it.toolbox.com/wiki/index.php/Hierarchical_Database
Suddenly I want to spend my next Saturday night setting up Cassandra at a party...
Edit: Come on, "dropping acid"! I thought it was funny...
228MB of data is nothing, it fits in RAM of any machine. What would the examples in this blog post look like if it was 228GB of data spread across 10 servers instead? How would you grow/shrink such a cluster? How would you perform a query that cuts across servers and aggregates data from some of them? How would you replicate data across multiple servers?
You bring up an interesting point. At what data size do the companies using the various NoSQL DBs feel they have to move beyond a traditional RDBMS? I work with some traditional RDBMS stores now that are >500GB in size with tables that add 20M-30M rows/month and querying still isn't an issue. Admittedly it takes expertise to optimize the system and make it work efficiently, but that's going to be the case with any datastore.
Not that I particularly like Oracle as a programmer. I have to check my calendar every time I hit the 32-character limit for names or once again have to write
constraint "y_or_n" check some_field_expressing_truth in ('Y','N');
It's been amazing watching the performance surge in postgres these past few years. I wonder if Red Hat or similar will try sponsoring a tilt at the TPC-C crown in future.
I don't care about what sells NoSQL products to enterprise users - there are lots of workloads where the data fits in RAM (for some reasonable quantity of RAM that money can buy) and you still prefer the durability and consistency that comes with standard SQL databases, even if you denormalize and/or use schemaless (XML or JSON) storage.
Your point is valid of course. I just wanted to point it out for people that think each action in a RDBMS results in a costly IO operation.
You make it sound as if these products scale "magically". This is definitely not the case. There's quite a bit of massaging needed to make riak, hbase et al scale beyond certain thresholds - and you better pay close attention to their respective peculiarities.
How would you perform a query that cuts across servers and aggregates data from some of them?
That's actually not very hard to implement once you understand how it needs to be done. And you do need this understanding with any k/v-store, otherwise you'll be very sad when the magic stops working (as documented time after time for MongoDB).
Starting out with a known-good, mature store such as PostgreSQL can make a lot of sense when the trade-offs match your usage pattern.
That's just not true. Some NoSQL products (HDFS, Cassandra etc) sell on the ability to easily scale out. Other (CouchDB, MongoDB etc) focus on other features. CouchDB (for example) doesn't have a real scale-out story at all (beyond "manually shard your data", or try http://tilgovi.github.com/couchdb-lounge/), but that isn't really a problem because it has other features that sell themselves.
I am in no way substituting or saying 228MB is enough data! I would rather hate MongoDB for being RAM hungry and storing same 115k tweets in Gigabytes of memory (256MB is just like a started of 100 Course meal for MongoDB). Again Facebook ultimately prefers his data to go to MySQL and they have largest 600 shard for a reason!
Um, what about Riak, Cassandra, Voldemort, and Hbase? (I'm sure there's a bunch more I'm forgetting)
Again you are missing the point of maturity and a proven user base, and it's comparing apple with bananas! Try putting in same joins and relations in your NoSQL stores that you are bragging about and see how quickly they will lose scaling! Want an example? Neo4J!
I thought CouchDB required this as well?
Of course if you've been using SQL for years then this probably sounds difficult in comparison. Except what if you're a JS guy with zero SQL experience?
"Ok, it's a database. How do I query it?"
"You learn this completely new language and dynamically compile your questions down to it, but you have to be really careful because the process is notorious for being a first-class attack vector."
"Did you just tell me to go fuck myself?"
I'm not trying to say anything about the merits of SQL. I'm just pointing out that it's a matter of perspective.
The thing about SQL as an attack vector is frustrating because it (usually) doesn't need to be: use prepared statements and let the driver handle value substitution for you. It's quicker and easier than escaping everything.
I've never found the same level of finish, performance, etc. It's also very customizable and very dependable. You can get extremely good performance out of it.
Finally... the devs are plain AWESOME. Every time I had an issue, question, etc, not even necessarily related to PGSQL but to SQL and performance in general they would reply quickly, and accurately (I owe one of them many, many beers).
To give you an example, a company I worked for was doing a 10M to 10M comparison (for very specific needs). All businesses we have found were using custom NoSQL solutions, or even Oracle, with huge clusters, costing a lot, and taking ~24h for a result.
So we implemented ours. After a year of testing, coding, etc, it was settled on PGSQL (on FBSD), with the help of the said dev, we were doing the same comparison,in 3H, on a single quad core system (with heaps of ram).
Sure, maybe the other businesses never cared that much for performance, but its still astonishing. Most optimizations were made thanks to PGSQL's dev help were usually speeding up the process by thousand of times.
Disclaimer: I can't detail the comparison (its still under NDA), but it included very specific stuff, in case you wonder about the time taken, etc.
A few years ago I got a dreaded call from a customer. A point of sale system was taking 45 seconds suddenly to post invoices. This is a big deal for a cash register role, as you can imagine.
So I went on site, ran some diagnostics, isolated the problem query, etc. At this point they were into the slow part of the afternoon so I asked about it on the -perform list along with query plans and everything else I could think of.
Within an hour I had received an email from Tom Lane explaining:
1) What the planner was getting wrong
2) Why the planner was making an error there, and why this was a special planner case.
3) Giving me suggestions as to how to fix it.
First rate support, let me tell you. (Also, planner behavior was changed in a release soon after to accommodate my use case.)
About 10 years ago I had cause to look at the source and I have to say I don't think I've ever seen better written C. The changes we were considering were vetoed, but I marveled at how easy it was to identify the place I would have needed to make the changes and verify that they would not have unintended consequences.
I don't know if it still does, but the code I saw back then read like a book.
It's not so great if you start using something and one year later see that the original authors lost interest and move on, leaving you with 10k lines of code that depend on a pray-for-no-bugs unsupported piece of software.
> ...gives me 14689 rows just under 360ms on average
Uh, 360ms seems like an awfully long time to query such a small dataset.
Also, I'm pretty sure Heroku uses hstore, because they're the ones that taught me about it.
So not everyone forgot about it.
... which can then be used to power custom indexes of particular collections of documents, like this one:
The fun bit being, that in both Postgres and Solr, you can still run a precise search (including joins for access control, sharing, and boolean full text clauses) with a single query because the HStore keys and values are indexed.
The "Docdata" model handles serializing JSON hashes of keys -> string values into the hstore format, and parsing 'em back out. This is a little loose, but works for most of the simple things you'd like for tagging documents (senator: Kerry, statecode: WV, recordid: 019281)
... the resulting effect is that from the Rails side, working with "document.data" directly is like working with a lazy-loaded hash.
The more interesting bit is the search side of things, here's the query parser and query object (which generates SQL and Solr queries):
... for any given query, we detect if Solr is required (if full-text is involved), or if the search can be satisfied purely in Postgres. So there's one way of generating the key:value query for each back end.
Hope that answers the question.
(Edit:) Forgot to mention that we add a little extra on top of vanilla hstore "key:value" queries. You can write "dob: *" to show documents with any senator tagged, and "dob: !" to show documents that haven't been tagged with a date of birth yet.
There are whole Salvatore posts about doing binary data structures in Redis string values.
Strings are the most basic kind of Redis value. Redis Strings are binary safe, this means that a Redis string can contain any kind of data, for instance a JPEG image or a serialized Ruby object.