You can actually get pretty far on a single MySQL database and not even have to worry about partitioning at the application level. You can "scale up" to a machine with lots of cores and tons of ram, plus a replica. If you have a layer of memcached servers in front of the databases (which are easy to scale out) then the database basically only has to worry about writes. You can also use S3 or some other distributed hash table to take the largest objects out of rows in the database. There's no need to burden yourself with making a system scale more than 10x further than it needs to, as long as you're confident that you'll be able to scale it as you grow.
I've dealt with some insanely convoluted legacy systems that were made up of a fragile bunch of badly configured clusters ("for scalability!") that I improved the speed and load beariness by 100000x simply by putting it on a single server, tons of ram, and combing through the slow query log.
We typically write apps for 10 to 50 users; our complexity is in reporting, not request throughput. There's no good way to get around having to outer join on 10 different legacy tables of non-normalized data (which is that way because of some cheese-head early-optimizer who hasn't worked here in three years) to get the report that the client insists is absolutely needed regardless of being unable to come up with some sort of business case.
it was outsourced to a vendor that pretty much took advantage of the contracting startup's lack of DB knowledge. so they got the kitchen sink and a fat invoice.
of course it's easy for the vendor to defend this - "what, we want you to be scalable!".
Assuming 100 requests per user per month, 60k uniques is 2 requests per second. They could run the whole site on one Linode unless they've got a particularly bad architecture.
Your point is valid, but it doesn't work like that.
When scaling a web app, you need to prepare for peak usage, not average usage.
Here are a few first-hand examples. Friends for Sale, a social game, had ~10M daily pageviews (100 req/sec avg) but had peak load of > 200 req/sec. Ideeli, a "flash sale" designer shopping site, typically had peak load 5x-20x higher than median load.
Regardless of what Quora's traffic curve looks like, wheels' point still stand - it doesn't matter what backend you're using for 60k users on a web app.
I know what you're trying to say, but at Quora's numbers, as OP said, it truly is irrelevant.
I've employed many of the tactics he describes and have achieved >15 M/s of dynamic content through MySQL. Not Facebook, but it was fast and scalable for the market my app was geared towards.
*Not that I'm comparing any NoSQL database or NoSQL databases in general to Brainfuck, it's just that I don't see the point of "why we don't use X" posts when there was no particular reason to suppose that they might use X in the first place.
Like most things, choosing a database on an ideological basis is probably not a good idea. As my grandpappy used to say "The right tool for the right job" (he didn't really say that, I'm just making things up).
Are there facts to back up what he says about MongoDB for example or is this just FUD.
Disclaimer: this was the first look I've had at MongoDB. However, based on this alone, it would be prudent to wait several years before considering using this technology in serious production. It is somewhat reckless to consider a DBMS feature that wasn't locked for awhile "production-ready".
It's not the most sound technical reasoning.
It took me 48 hours to convert to it, 10 months later and I don't feel like I know much about it.
That's partly because Mysql always was more user friendly, and I also attended one Mysql conference and got to know some people working on Mysql.
That said, overall Postgres is better, with one exception ... Mysql supports multiple storage engines for various needs. In Postgres you don't have that flexibility.
Examples? I started with Postgres and ignored MySQL from day one because of MySQL's deficiencies, and have had 0 problems learning what I needed in the years I've used it. Not to mention I can read a general SQL reference, or the standard, and the information actually applies. Unlike MySQL (quotation marks anyone?).
> That said, overall Postgres is better, with one exception ... Mysql supports multiple storage engines for various needs. In Postgres you don't have that flexibility.
"various needs": what, for those times when you don't care whether your data gets silently corrupted?
I never had data-corruption in Mysql. Of course, I'm always working with InnoDB, maybe MyIsam is weaker
And if Mysql would have been such a weak RDBMS, maybe it wouldn't have been used by companies like Google or Amazon (especially Amazon RDS, which are already configured Mysql instances offered to clients in Amazon's cloud).
> "various needs": what, for those times when you don't care whether your data gets silently corrupted?
Ebay is using the memory tables engine in Mysql instead of Memcached. According to their benchmarks, the Mysql client is more stable, and the combination is more scalable.
There's also a whole ecosystem of third party storage engines. Like when Adobe developed an in-house distributed storage ... they also had an (internal project) storage engine for Mysql that interacted with it.
Also, the soon to be replacement for InnoDB is Falcon ... http://forge.mysql.com/wiki/Falcon (optimized for multiple cores).
Or RethinkDB (YC startup) ... a storage engine optimized for solid-state drives ... http://rethinkdb.com/
And I don't really care about what others think ... Mysql served me well.
You can even turn on async commit (since 8.3), which sends a commit acknowledgement message back to the client before the transaction is fully committed, allowing sync clients to move forward. While this does allow a very small window of time where, if the power was pulled on a machine, you'd lose just that bit of data for the tenth of a second it wasn't written to disk. All the other data is safe though, unlike MyISAM's default fsync=off, which gives it all of it's "performance." PostgreSQL also has async network I/O support baked into the libpq driver, which, if the client is written properly, makes this unnecessary. Note that async commit can be set per-connection and toggled on and off.
Virtually all of the 7.x releases were focused on durability, and 8.x releases were focused on performance, culminating with 9.0, due out soon, which will bring the much-needed readable slaves. This is the final feature that will tip the scalability gap in PostgreSQL's favor.
There are is a situation where MySQL shines: read-only MyISAM tables with simple queries where joins are not used. Virtually all of the write concurrency and data integrity features as well as PostgreSQL's performance using multiple indexes and joins become unimportant at this point. MyISAM also supports compressed read-only tables, which are great for this situation. I'd have to ask myself why I wasn't using a simpler NoSQL solution at this point though.
I've thought about switching to Postgres, but it doesn't currently appear to be worth the pain in my case.
† NDB may be a perfectly reliable option, I have not used it.
It should be noted that the 10x quote comes from Jeff Dean. Not only does it make sense from the business point of view, it's also nearly impossible to scale a system 100x: you simply have no idea what the traffic patterns and bottlenecks will be at 100x the volume.
In other words, going from 60k uniques to 600k uniques can be possible with the same architecture. Thinking you can go 60k to 6,000,000 on the same architecture is hubris (as is planning for the 6,000,000 mark while you're at 60k).
It's also extremely annoying to see the what started as people looking into right tools to do their jobs dissolve into a hype bandwagon (but I guess it happens to any technology and I'll be a happier person if I ignore this :-)). First the hammer for all persistence was MySQL + an ORM (ignoring both other RDBMS e.g., Postgres, approaches such as BerkeleyDB), then it was CouchDB now it seems to be Mongo or Cassandra. These are entirely different systems, each has a use case where it shines. Why are people all the sudden demanding that others use a "NoSQL solution"?
That being said, MySQL isn't that hot in terms of reliability (of course starting off with Oracle is recipe for some technical debt and vendor lock-in). There's no substitute for proper backups and administration. This is more of an argument for MySQL (or my personal favourite, Postgres) than against it: with MySQL the backup and administration solutions are very well known.
Some of the points mentioned in the article in regards to distributed systems are also fairly confusing and incorrect. CouchDB and Mongo are (scalability-wise) essentially the same as MySQL: "potentially consistent" replication (no ability to perform quorum reads, no guarantee of eventual consistency, no version vectors) and
not truly distributed (a 1000 cluster of consisting of 500 "shards" two replicas each is really 500 separate clusters). MySQL already has very relaxed semantics (a master-slave replication system makes absolutely _no_ guarantees about commits made to master being consistently read from the slaves or even being persisted on the slave if the master crashes).
Dynamo and BigTable inspired systems are radically different. However, as with any other system under active development (which also includes RDBMS such as VoltDB), unless you're ready to ask a committer on one of these systems to institute a special QA cycle and support for your use case (or you are a committer on one of the systems, in which case you know what you're doing) when using one of these systems as a primary datastore, you're taking on a technical risk. A technical risk that's not needed when you need to persist data for 60k monthly uniques: I can do that just fine with any system. Depending on the use patterns of the application, you can also do for 60,000,000 monthly uniques while sticking with MySQL (while most other parts of the applications, including the data access layer to MySQL, would have to be re-written many times over).
In the end Adam's decision is probably the correct decision (I don't work for Quora, so I can't tell), but the technical reasons sound slightly ignorant. It's perfectly fine to be ignorant of these topics, it's not, however, okay to speak authoritatively on a topic you're ignorant of (e.g., I know nothing of machine learning, so I am not going to speculate on how Quora should be using machine learning to make their homepage more relevant -- which they probably are -- I'd add nothing to the conversation).
So, schemaless vs. schemaful is similar to static typed vs. dynamic typed? (or is it more like typed vs. untyped?)
> I think the "NoSQL" fad will end when someone finally implements a distributed relational database with relaxed semantics.
What does he mean by "relaxed semantics"?
There are two ways to build on top of MySQL - you can either make a flexible schemaless structure with objects and properties and associations between them, which is what Facebook advocates for, or you can make a flexible virtual schema that has column names, types, and arbitrary indexes. I'm in favor of the latter.
(I was counting that under "you have to do it yourself" and thought it might do something more automatic but still infrequent.)