I wrote the first version of our data store modeled off of FriendFeed's architecture (back when we were five people). There is an excellent blog post about it here:
When Cassandra came along, we had long debates internally about the risk of hitching ourselves to their wagon. Eventually we decided that an open source project with a similar (enough) architecture to what we'd been building ourselves was preferable to maintaining a python library in-house.
So far we have no regrets. We have had no production issues and we have an architecture that is built to handle a large amount of incoming data (every note and highlight you take in an Inkling book is synched to our servers). We had already stripped away most SQL functionality by using FF's architecture, so we didn't lose much. And we gained a lot - the ability to tune the durability of different writes depending on the use case, far better fault tolerance, a conceptually simpler indexing strategy, etc.
It definitely has a learning curve - error and status reporting is not wonderful, you generally need to be over-provisioned as a strategy, etc. But I'm happy we learned those lessons sooner rather than later - I'd hate to scramble to implement Cassandra in a strained production environment (my sympathies to a certain website who suffered through this).
I use MongoDB for greenfield app development and experiments, so I can quickly alter "tables" and "attributes", without having to bother with first thinking about schema, or having to write migrations.
I also use it for some smaller tasks like storing filenames/hashes for distributing content across my 5 servers (doesn't really need to be mongo for this), and some configuration information for games using me.
I do it via MongoHQ, I did have it running and replicated across 2 of my servers but I'm just not familiar enough to keep it running and that bit me in the ass early on. With MongoHQ all I have to worry about is making sure my indexes are right.
Next up I'll be using it for some non-aggregated analytics stuff, which won't be as big as Facebook's requirements but will still be in the billions-a-month range.
For our project we went the Windows Azure route and store our 'events' in their NoSql Table Storage. I must admit I have been tempted to switch to MongoHQ due to the lack of query support from Table Storage. Funny enough, we're also working on a custom leaderboards feature for our customers and the lack of secondary indexes really makes it difficult.
The custom data on the leaderboards and levels is just ridiculously easy, it's basically just copying object properties/values on the user end straight in to document properties/values in MongoDB.
1) To give people some reasonable expectations about how much data they should be putting on the plan given it is a multi-tenant server and they are sharing resources.
2) To make room for an expanded offering, both a larger high-availability shared plan, and a dedicated server plan.
The article cites graph databases as a better fit for graph applications than relational SQL databases. I'm using Redis data structures like sets and sorted sets that are 20x to 100x faster than relational databases.
But I still use SQL for most CRUD operations where I care about consistency and volume isn't high enough to require sharding. There's a place for 'NoSQL', but it's not a panacea.
MongoDB is also rapidly becoming my "goto" data store when making prototypes to explore some new software project space. Because it's so amenable to rapid development and change. It's also replacing memcached in some situations where all I needed was a dumb memory cache. It can act like a dumb memory cache, except it has these extra features waiting in the wings I want, which is a bonus.
I liked Redis in my evaluations and may use it more in the future but it lost out to MongoDB for the LBS project because it didn't fit the requirements as well or get enough little "taste wins" with me.
Workers connect and get new jobs, they also write the results to redis. Everything touches redis and redis is just so damn fast that I don't have to care about scaling for a while :)
The low-hanging fruit have been counter caches (lock contention with mysql counters was a problem wayyy too early), transient data (sessions, ip bans, etc), and picking random items from sets (other RDBMSes may have better ORDER BY random() properties, but MySQL sucks).
In general I feel like 90% of our data is best served by a relational database. It's possible to shoehorn the rest in, but redis primitives allow highly targeted improvements to both performance and elegance.
We do a bunch of calculation with cached data in Redis, and it enables the naïve pattern of grabbing chunks of shit from the data store without having to cope with the horrible SQL generated by the ORM.
It is refreshing to see the approaches.
I'm still a bit wary of using NoSQL, and we store document revisions (e.g, if I modified invoice 1021, the lastest version would be updated in SQL, and the last version would be pushed into NoSQL) into SQL just in case (by using a draft_id column).
We need to see if our software is still deemed auditable by accountants since they have very strict guidelines on how a document trail should be stored.
My mind is still thinking in 3NF though. I understand dernormalization and avoiding joins will be useful from a performance standpoint. However I am unsure when to either go ahead and include a foreign key, retrieve it and perform a second query from the application layer, and when to go ahead and duplicate\embed all the field data. I'm leaning towards just making 2 simple sequential key lookup queries, the 2nd on the retrieved foreign key, rather than duplicating fields everywhere and keeping track of massively cascading changes. Instead of performing 1 MySQL Join. Although I usually think in terms of minimizing roundtrips to the database server.
Wondering if anyone has a heuristic for this or suggested reading?
Outside of this, you need to switch the question you ask from "what data do I need to capture?" to "what questions do I need to answer?"
Looking for a way to hands-free scale very cheaply without vendor lockin. Would be nice if I can simply add another machine to the database to the cluster. And not have to generate the ids in the application layer, use hashing algo to select correct machine, and have everything stop working when a single database goes down. Seems like it should be a solved problem by now. Investigating new tech won't set me back much time, and my MySQL queries aren't going to disappear if I don't like it. Furthermore looking at large sites such as Flickr that massively scaled MySQL it seems like they stopped using its relational features anyway.
There are other operational issues with MongoDB. MongoDB can only do a repair if there is twice the available disk space as the database uses, and the server must be effectively brought offline to do this. To reclaim unused disk space, you have to do a, you guessed it, compact/repair. Want to do a backup? The accepted way to do this is to have a dedicated slave that can be write-locked for however long it takes to do your backup. They suggest using LVM snapshots to make this short, but disk performance on volumes with LVM snapshots is terrible.
I would consider using MongoDB for a setup that would either be either non-critical, completely within memory with bounded growth (which itself sort of begs the question...), or involve mostly write-once data, such as feeds, analytics, and comment systems.
Regardless, I think my initial question regarding when to denormalize data applies to any database including scaled MySQL, but perhaps was a better question for stackoverflow.
Cassandra has online compaction, but still requires up to 2x space for compaction. However, Cassandra does not have to do a full scan of the entire database to do compaction, and almost never actually uses the 2x space. It's also much easier to maintain a Cassandra cluster, because each instance shares equal responsibility, and replication topology is handled for you.
Despite what their fans will say, these are both beta-quality products.
I don't have any ready figures in front of me for disk performance, but MongoDB uses an in-memory layout and memory-mapped files, which provide a far less than optimum data layout for disk performance. As you might imagine, it works well with SSDs, but the performance is awful on rotational disks. Foursquare's outage was caused in part when their MongoDB instance began to exceed available RAM. My own experiences with larger-than-RAM MongoDB collections mirror this conclusion. Under these circumstances, you're likely to see much worse performance with MongoDB than MySQL or PostgreSQL, especially with concurrent writes.
As far as the SQL pitfalls, and database in general, don't think for a second MongoDB has some magic that exempts it from performance problems that RDBMS' have. Start using any of the familiar SQL scaling "no-no" features: multiple indexes, range queries, limits/offsets, etc.... and it's going to start exhibiting performance characteristics like a PostgreSQL or MySQL database would under those circumstances.
The temptation with MongoDB is to be lured into this idea that a brand new, immature database with convenient, SQL-like features can perform significantly better than it's highly-tuned RDBMS brethren. There are some reasons to choose MongoDB over MySQL and PostgreSQL, but performance should not be one of them.
Basically, if you are prototyping and have a few extra hours to spend playing, I would say go for it. Can't hurt to understand the tool, so you can pick it if it makes sense for what you need to do down the road.
Result: The author has to qualify all statements with "only applies to some NoSQL databases".
These days, I use redis for caching and as a celery backend, and I love it to bits.
Also, because everyone is going to skim the post and say "you were using the 32-bit version":
1. Only for some of the corruptions.
2. IT SHOULDN'T CORRUPT DATA ANYWAY!
Personally, I actually love SQL. You asked me, nothing satisfies like nested right inner joins. But that's not who I am. I have needs. Sometimes, what I need is a schemaless eventually-consistent document-oriented persistent data store, because I am aggregating data from multiple web service APIs whose field names and structures change around like they were samples on a Girl Talk record. CouchDB ain't SlouchDB... I can dance to that.
I can tell that some of my buddies are embracing databases from the ranks of the quote-unquote NoSQL movement because these databases' aesthetics are not at all like SQL. That's the catch with NoSQL -- it's a really pointless thing to talk about because it's not a thing; it is an un-thing, a classification of everything in the contemporary database world that is not SQL. It's the kind of classification that makes the most sense in the emotional context of how people feel about SQL.
Redis is also great as a backing store for the Rails queuing system Resque.
We are still beta testing and have 100K properties and ~1 million images stored. Query time is faster than our current LAMP site. We will shard based on MLS Vendor as we add more.
Similar sites we're working on use CouchDB in the same capacity, because we can push simple analysis back into the database and represent more complex data in a natural form.
When people are paying for a service then ACID db is a must.
At their core, each of these systems use ACID databases (or very nearly if you nitpick about isolation levels in Oracle). Between databases and between companies, they've developed "eventually correct" schemes to synchronize information. The lease patterns that many of these systems rely on require atomic operations at their core and offer stronger guarantees than the basic eventually consistent systems popularized by models like Amazon's Dynamo.
It's not just that the account values have to agree between systems at the end of the day; they have to actually be correct at the end of the day.
I know that consistency models vary between NoSQL systems (and even within a single NoSQL system). There's some great technology out there and plenty of problems to solve. There's are certainly plenty of use cases for NoSQL systems within banks and stock exchanges.
But the "banks are eventually consistent" line of reasoning needs to die.
Amazon's Dynamo itself is built on BerkeleyDB, which is ACID compliant. That doesn't mean Dynamo is an ACID system. You have to view the system as a whole, not just the component parts. The information systems I refer to in large banks, stock exchanges, and logistics are often composed of thousands of instances of ACID-compliant databases, but as a whole operate with eventual consistency guarantees. EC is kind of a misnomer for Dynamo anyways, because it's really TUNABLE consistency. Dynamo can operate in a fully consistent mode, but you're going to sacrifice availability. CAP theorem doesn't care if you're a bank or a stock exchange or you have a trillion dollars. It still applies.
That's not to say it's not great for lots of things. Amazon uses this model for all kinds of stuff, but as soon as you go to checkout your order, you get kicked back into ACID-ville.
You make a claim that Amazon uses ACID semantics for the checkout process. The Dynamo paper claims that in order to meet business goals, the complete Amazon.com order process must be highly-available and partition tolerant. A system that is ACID compliant must sacrifice availability or partition tolerance, but this isn't the case for Amazon's purchase process. Amazon simply strengthens consistency and durability guarantees in the case of checkouts with a quorum write. It would be exceedingly rare for a partition or disaster to knock out communication with more than one datacenter, so this works very well in practice.
Now atomic updates to single keys can actually be used as the building blocks for transactional functionality (using the lease pattern and/or compensating operations). If you need a transaction here or there, this might be a very workable solution. If you need lots of transactions, then you end up using Dynamo systems in unnatural ways; many of their performance and availability advantages are wasted in this configuration.
So yes, you could build a bank given quorum writes, but the point is that it would probably be a poor engineering choice.
Bank accounts are eventually consistent logs, which doesn't look like Dynamo, but also look nothing like an RDBMS table. It's well known and understood that almost all banking systems are based around mainframe-era batch-process systems that are eventually consistent in nature. It's awfully ironic that the "hello world" used to demonstrate transactions in the RDBMS world is a debit and credit of bank accounts, a situation that is vanishingly rare in the real world.
Also, please address my original rebuttal to your claim that Amazon uses an ACID database for their checkout process.
I sure don't disagree with everything you are saying.
As for adding transactions to get a balance, yes, this is how the basics work and it doesn't look like ACID at all. However, what you're describing isn't eventual-consistency (per Dynamo), it's eventual-completeness. You can't lose a log entry. All of these messages are two-phase committed between systems.
So as I said before, you could build this kind of system using quorum writes on a Dynamo system, but it wouldn't be a solid engineering choice.
Also, yes, BerkelyDB has multi-key transactions. Dynamo on top of BDB does not expose this functionality or use it (as described in the paper). Publicly available systems that implement Dynamo like Cassandra, Riak or SimpleDB do not have native transactions either.
Fun chatting. I'm off to bed.
1) I totally concede that you could use Dynamo-style consistency to be the system of record for financial exchanges. I just don't think it's a good idea. I'm not sure anyone building these systems does either.
Though if you're doing e-commerce with a payment processing gateway like PayPal, you might be in a different situation. You're no longer anything like a bank at that point (PayPal is).
2) I don't know what Amazon does now, but last time I talked with them (2009) they were huge users of legacy RDBMSs for actual order processing. I know they're not thrilled with this arrangement, but I doubt they think that they think Dynamo is the answer. Perhaps cores of ACID-ness with Dynamo-esque replication between them...
3) You're right that banks use logs for a tremendous amount of things. And a VISA charge looks nothing like a simple debit and credit transaction. However, if you think that banks don't have debit and credit transactions in their applications, you haven't worked on their applications.
You stated previously that most of these real-world, high-scalability, high-availability systems use ACID databases as a backing store, something I echo'd in the context of Dynamo's use of BerkeleyDB (and MySQL/InnoDB) as an underlying data store. I agree with this.
You state that a Dynamo-style key-value store wouldn't be that great for a financial institution's accounting system. I agree with this. Dynamo is terrible for "log" data.
This doesn't change the idea that these system as a whole aren't ACID, and do not require ACID semantics. Financial systems are often based on networks (in the conceptual sense) with varying degrees of trust, and are batch-reconciled logs of transactions. I'm talking about stock and commodity markets, electronic transfer systems, and traditional banking. They may be built from ACID building blocks at the low level, but that does not change the fact that systems that manage my checking account, credit cards, and stock market transactions are not functionally ACID or that they even need to be ACID. Dynamo itself is built on ACID databases, but it is not ACID itself.
The balance of an account can be calculated by adding all credits and debits, which is what makes a financial account radically different than say, a Facebook profile. Each transaction is essentially immutable. Even if the transaction must be reversed, it's reversed as an additional credit (or a unique reversal transaction), not a deletion of the original transaction. While transactions against the account are almost always immediately available, they aren't ALWAYS, and this is alright, because the bank's customers understand this. Banks still reconcile accounts on a nightly basis, using batch processes. This is when accounts are officially settled and consistency is applied, which is why it's eventual. However, financial accounts are somewhat unique because their date-sorted, log-structured nature makes them quite suited to eventual consistency.
As for banks, exchanges, credit cards, packages, etc:
When non-trivial sums of money are changing hands, ACID stores offer tremendous benefits as a building block in a larger system.
Yes, things go wrong. Even given ACID building blocks, building huge banking systems is hard. Forcing bank developers to worry about consistency not just between systems, but also within systems isn't going to make that job easier.
Also, "free web apps" doesn't translate to "worthless user data", although that's a different argument altogether.