There's no need to burden yourself with making a system scale more than 10x further than it needs to, as long as you're confident that you'll be able to scale it as you grow.
It should be noted that the 10x quote comes from Jeff Dean. Not only does it make sense from the business point of view, it's also nearly impossible to scale a system 100x: you simply have no idea what the traffic patterns and bottlenecks will be at 100x the volume.
In other words, going from 60k uniques to 600k uniques can be possible with the same architecture. Thinking you can go 60k to 6,000,000 on the same architecture is hubris (as is planning for the 6,000,000 mark while you're at 60k).
It's also extremely annoying to see the what started as people looking into right tools to do their jobs dissolve into a hype bandwagon (but I guess it happens to any technology and I'll be a happier person if I ignore this :-)). First the hammer for all persistence was MySQL + an ORM (ignoring both other RDBMS e.g., Postgres, approaches such as BerkeleyDB), then it was CouchDB now it seems to be Mongo or Cassandra. These are entirely different systems, each has a use case where it shines. Why are people all the sudden demanding that others use a "NoSQL solution"?
That being said, MySQL isn't that hot in terms of reliability (of course starting off with Oracle is recipe for some technical debt and vendor lock-in). There's no substitute for proper backups and administration. This is more of an argument for MySQL (or my personal favourite, Postgres) than against it: with MySQL the backup and administration solutions are very well known.
Some of the points mentioned in the article in regards to distributed systems are also fairly confusing and incorrect. CouchDB and Mongo are (scalability-wise) essentially the same as MySQL: "potentially consistent" replication (no ability to perform quorum reads, no guarantee of eventual consistency, no version vectors) and
not truly distributed (a 1000 cluster of consisting of 500 "shards" two replicas each is really 500 separate clusters). MySQL already has very relaxed semantics (a master-slave replication system makes absolutely _no_ guarantees about commits made to master being consistently read from the slaves or even being persisted on the slave if the master crashes).
Dynamo and BigTable inspired systems are radically different. However, as with any other system under active development (which also includes RDBMS such as VoltDB), unless you're ready to ask a committer on one of these systems to institute a special QA cycle and support for your use case (or you are a committer on one of the systems, in which case you know what you're doing) when using one of these systems as a primary datastore, you're taking on a technical risk. A technical risk that's not needed when you need to persist data for 60k monthly uniques: I can do that just fine with any system. Depending on the use patterns of the application, you can also do for 60,000,000 monthly uniques while sticking with MySQL (while most other parts of the applications, including the data access layer to MySQL, would have to be re-written many times over).
In the end Adam's decision is probably the correct decision (I don't work for Quora, so I can't tell), but the technical reasons sound slightly ignorant. It's perfectly fine to be ignorant of these topics, it's not, however, okay to speak authoritatively on a topic you're ignorant of (e.g., I know nothing of machine learning, so I am not going to speculate on how Quora should be using machine learning to make their homepage more relevant -- which they probably are -- I'd add nothing to the conversation).
You can actually get pretty far on a single MySQL database and not even have to worry about partitioning at the application level. You can "scale up" to a machine with lots of cores and tons of ram, plus a replica. If you have a layer of memcached servers in front of the databases (which are easy to scale out) then the database basically only has to worry about writes. You can also use S3 or some other distributed hash table to take the largest objects out of rows in the database. There's no need to burden yourself with making a system scale more than 10x further than it needs to, as long as you're confident that you'll be able to scale it as you grow.
I've dealt with some insanely convoluted legacy systems that were made up of a fragile bunch of badly configured clusters ("for scalability!") that I improved the speed and load beariness by 100000x simply by putting it on a single server, tons of ram, and combing through the slow query log.
I've had my own wins with avoiding premature optimization. My personal data access code snippets use a lot of reflection to figure out what stored procedures to call without me having to specify too much metadata (going with the DRY principle). I started working at a new company and brought the code with me, where the resident "alpha nerd" criticized it, "all that reflection must be slow." I just smiled and did what I wanted to do anyway (I now outrank him) and showed the proof in the pudding: the ~50ms that the code spends in reflecting on its own stack trace is completely dwarfed by the overhead of connecting to the database, waiting on stored procedures to execute, transferring data over the network, and processing result sets into suitable outputs.
We typically write apps for 10 to 50 users; our complexity is in reporting, not request throughput. There's no good way to get around having to outer join on 10 different legacy tables of non-normalized data (which is that way because of some cheese-head early-optimizer who hasn't worked here in three years) to get the report that the client insists is absolutely needed regardless of being unable to come up with some sort of business case.
Taking daily pageviews and dividing it by 24 * 60 * 60 assumes your traffic curve isn't a curve at all - it's flat. For your average US-based consumer web app it's highly unlikely to have just as many people using your app at 3am as you do 3pm.
When scaling a web app, you need to prepare for peak usage, not average usage.
Here are a few first-hand examples. Friends for Sale, a social game, had ~10M daily pageviews (100 req/sec avg) but had peak load of > 200 req/sec. Ideeli, a "flash sale" designer shopping site, typically had peak load 5x-20x higher than median load.
Regardless of what Quora's traffic curve looks like, wheels' point still stand - it doesn't matter what backend you're using for 60k users on a web app.
There's an entire branch of mathematics that covers this. It's called "queuing theory". It's pretty strait forward to find the probability of N requests per second for an M/M/1 system. Then you decide, "I want to be able to handle the 99.9% case" (or whatever) and you know what your peak capacity has to be.
I don't understand all these "why we're not using NoSQL" posts. Why is it interesting that someone choose the default choice? Why would they use a NoSQL database of some sort for a major project unless they actually had a need for it? I'm half expecting a "why we don't use Brainfuck"* post one of these days...
*Not that I'm comparing any NoSQL database or NoSQL databases in general to Brainfuck, it's just that I don't see the point of "why we don't use X" posts when there was no particular reason to suppose that they might use X in the first place.
I agree, and I think it's a side effect of the polarization that's happened in the database world. Picking the term 'NoSQL' was maybe a bit aggressive, as it grouped a bunch of new ideas and projects under one banner as well as defined a clear 'opponent' to touch off the new database holy war.
Like most things, choosing a database on an ideological basis is probably not a good idea. As my grandpappy used to say "The right tool for the right job" (he didn't really say that, I'm just making things up).
> 2. These distributed databases like Cassandra, MongoDB, and CouchDB aren't actually very scalable or stable. Twitter apparently has been trying to move from MySQL to Cassandra for over a year. When someone reports using one of these systems as their primary data store for over 1000 machines for over a year, I'll reconsider my opinion on this.
Are there facts to back up what he says about MongoDB for example or is this just FUD.
His assertion is actually backed up by MongoDB docs: "Sharding will be production-ready in MongoDB v1.6, estimated to be released in July 2010." Their current sharding alpha is only tested/supported for small clusters (up to 20 nodes). In addition, looking at their sharding mode of operations, only a very limited set of actions are properly sharded; the rest require touching all the shards.
Disclaimer: this was the first look I've had at MongoDB. However, based on this alone, it would be prudent to wait several years before considering using this technology in serious production. It is somewhat reckless to consider a DBMS feature that wasn't locked for awhile "production-ready".
I finally made the switch from MySQL to Postgres for my current product. The impetuous was licensing as I got burned by MySQL's ambiguous licensing on a prior product. Now that I have gotten over the learning curve I like Postgres about the same as MySQL. Both are good products but you can move Postgres from a Saas to an onsite deployment as needed without any licensing issues.
> That's partly because Mysql always was more user friendly
Examples? I started with Postgres and ignored MySQL from day one because of MySQL's deficiencies, and have had 0 problems learning what I needed in the years I've used it. Not to mention I can read a general SQL reference, or the standard, and the information actually applies. Unlike MySQL (quotation marks anyone?).
> That said, overall Postgres is better, with one exception ... Mysql supports multiple storage engines for various needs. In Postgres you don't have that flexibility.
"various needs": what, for those times when you don't care whether your data gets silently corrupted?
> "various needs": what, for those times when you don't care whether your data gets silently corrupted?
I never had data-corruption in Mysql. Of course, I'm always working with InnoDB, maybe MyIsam is weaker
And if Mysql would have been such a weak RDBMS, maybe it wouldn't have been used by companies like Google or Amazon (especially Amazon RDS, which are already configured Mysql instances offered to clients in Amazon's cloud).
> "various needs": what, for those times when you don't care whether your data gets silently corrupted?
Ebay is using the memory tables engine in Mysql instead of Memcached. According to their benchmarks, the Mysql client is more stable, and the combination is more scalable.
There's also a whole ecosystem of third party storage engines. Like when Adobe developed an in-house distributed storage ... they also had an (internal project) storage engine for Mysql that interacted with it.
PostgreSQL has better performance on real-world, high-concurrency loads over multiple cores. MyISAM is slow on tables with lots of writers (table level locking). InnoDB is better, but PostgreSQL is even better. Supposedly MySQL 5.4 is going to fix this. Move into big queries with lots of joins, subqueries, using multiple indexes, and PostgreSQL starts to shine even more.
You can even turn on async commit (since 8.3), which sends a commit acknowledgement message back to the client before the transaction is fully committed, allowing sync clients to move forward. While this does allow a very small window of time where, if the power was pulled on a machine, you'd lose just that bit of data for the tenth of a second it wasn't written to disk. All the other data is safe though, unlike MyISAM's default fsync=off, which gives it all of it's "performance." PostgreSQL also has async network I/O support baked into the libpq driver, which, if the client is written properly, makes this unnecessary. Note that async commit can be set per-connection and toggled on and off.
Virtually all of the 7.x releases were focused on durability, and 8.x releases were focused on performance, culminating with 9.0, due out soon, which will bring the much-needed readable slaves. This is the final feature that will tip the scalability gap in PostgreSQL's favor.
There are is a situation where MySQL shines: read-only MyISAM tables with simple queries where joins are not used. Virtually all of the write concurrency and data integrity features as well as PostgreSQL's performance using multiple indexes and joins become unimportant at this point. MyISAM also supports compressed read-only tables, which are great for this situation. I'd have to ask myself why I wasn't using a simpler NoSQL solution at this point though.
For any data you want to reliably read back, MySQL really has one (standard) engine†. Nearly every MyISAM DB I've used in a high traffic environment has eventually corrupted. Your choice is really either InnoDB or ‘repair table’.
† NDB may be a perfectly reliable option, I have not used it.
The built-in MySQL schemas have problems - mostly that schema changes can take hours or days on large tables. As a result of this you need to build a system that doesn't require schema changes at the MySQL level. This doesn't mean you can't use schemas at a higher level though.
There are two ways to build on top of MySQL - you can either make a flexible schemaless structure with objects and properties and associations between them, which is what Facebook advocates for, or you can make a flexible virtual schema that has column names, types, and arbitrary indexes. I'm in favor of the latter.
Ah OK. I believe redis persists automatically when it can, from the memory, though I haven't actually used it. With a MySQL memory table, if you want to write to the memory table AND have it persist, you have to keep track of what has changed yourself, and handle writing those things to disk.