Hacker News new | comments | show | ask | jobs | submit login
Why Quora uses MySQL rather than No-SQL (quora.com)
175 points by bokonist 2494 days ago | hide | past | web | 56 comments | favorite

I find this answer incredibly excellent.


You can actually get pretty far on a single MySQL database and not even have to worry about partitioning at the application level. You can "scale up" to a machine with lots of cores and tons of ram, plus a replica. If you have a layer of memcached servers in front of the databases (which are easy to scale out) then the database basically only has to worry about writes. You can also use S3 or some other distributed hash table to take the largest objects out of rows in the database. There's no need to burden yourself with making a system scale more than 10x further than it needs to, as long as you're confident that you'll be able to scale it as you grow.

I've dealt with some insanely convoluted legacy systems that were made up of a fragile bunch of badly configured clusters ("for scalability!") that I improved the speed and load beariness by 100000x simply by putting it on a single server, tons of ram, and combing through the slow query log.

Sounds like pre-mature optimization, a bad idea which hackers need to be constantly reminded of.

I've had my own wins with avoiding premature optimization. My personal data access code snippets use a lot of reflection to figure out what stored procedures to call without me having to specify too much metadata (going with the DRY principle). I started working at a new company and brought the code with me, where the resident "alpha nerd" criticized it, "all that reflection must be slow." I just smiled and did what I wanted to do anyway (I now outrank him) and showed the proof in the pudding: the ~50ms that the code spends in reflecting on its own stack trace is completely dwarfed by the overhead of connecting to the database, waiting on stored procedures to execute, transferring data over the network, and processing result sets into suitable outputs.

We typically write apps for 10 to 50 users; our complexity is in reporting, not request throughput. There's no good way to get around having to outer join on 10 different legacy tables of non-normalized data (which is that way because of some cheese-head early-optimizer who hasn't worked here in three years) to get the report that the client insists is absolutely needed regardless of being unable to come up with some sort of business case.

yes that's exactly what it was.

it was outsourced to a vendor that pretty much took advantage of the contracting startup's lack of DB knowledge. so they got the kitchen sink and a fat invoice.

of course it's easy for the vendor to defend this - "what, we want you to be scalable!".

Flip side: How many startups ask for a solution that will scale to millions of users per day once their app goes live?

It's pre-mature scalability, where noob web developers are fond of, thinking they are in the big league with tons of traffic.

It's a site with in the ballpark of 60k uniques per month. What they use is pretty irrelevant. Really all this reduces to is, "it's dumb to optimize for problems you don't have."

Assuming 100 requests per user per month, 60k uniques is 2 requests per second. They could run the whole site on one Linode unless they've got a particularly bad architecture.

> 2 requests per second

Your point is valid, but it doesn't work like that.

Do you care to elaborate?

Taking daily pageviews and dividing it by 24 * 60 * 60 assumes your traffic curve isn't a curve at all - it's flat. For your average US-based consumer web app it's highly unlikely to have just as many people using your app at 3am as you do 3pm.

When scaling a web app, you need to prepare for peak usage, not average usage.

Here are a few first-hand examples. Friends for Sale, a social game, had ~10M daily pageviews (100 req/sec avg) but had peak load of > 200 req/sec. Ideeli, a "flash sale" designer shopping site, typically had peak load 5x-20x higher than median load.

Regardless of what Quora's traffic curve looks like, wheels' point still stand - it doesn't matter what backend you're using for 60k users on a web app.



So worst case they will be doing 4 req/s instead of 2.

I know what you're trying to say, but at Quora's numbers, as OP said, it truly is irrelevant.

No, 5-20x * 2req/s av = 10 to 40 req/s.

Peak on a decent traffic site will be roughly 2x-3x the average. 5x-20x would be a blog that gets 10 visitors a day and after being posted on news.yc gets 100 visitors.

There's an entire branch of mathematics that covers this. It's called "queuing theory". It's pretty strait forward to find the probability of N requests per second for an M/M/1 system. Then you decide, "I want to be able to handle the 99.9% case" (or whatever) and you know what your peak capacity has to be.

Traffic is bursty, not smooth.

Without estimating the traffic of the author's site, I still find his argument valid.

I've employed many of the tactics he describes and have achieved >15 M/s of dynamic content through MySQL. Not Facebook, but it was fast and scalable for the market my app was geared towards.

I don't understand all these "why we're not using NoSQL" posts. Why is it interesting that someone choose the default choice? Why would they use a NoSQL database of some sort for a major project unless they actually had a need for it? I'm half expecting a "why we don't use Brainfuck"* post one of these days...

*Not that I'm comparing any NoSQL database or NoSQL databases in general to Brainfuck, it's just that I don't see the point of "why we don't use X" posts when there was no particular reason to suppose that they might use X in the first place.

I agree, and I think it's a side effect of the polarization that's happened in the database world. Picking the term 'NoSQL' was maybe a bit aggressive, as it grouped a bunch of new ideas and projects under one banner as well as defined a clear 'opponent' to touch off the new database holy war.

Like most things, choosing a database on an ideological basis is probably not a good idea. As my grandpappy used to say "The right tool for the right job" (he didn't really say that, I'm just making things up).

I'm not 100% sure how Quora works but is this a post or an answer to a question that another person posted?

It's an answer to a question that someone else posted. I wouldn't have written it otherwise.

> 2. These distributed databases like Cassandra, MongoDB, and CouchDB[3] aren't actually very scalable or stable. Twitter apparently has been trying to move from MySQL to Cassandra for over a year. When someone reports using one of these systems as their primary data store for over 1000 machines for over a year, I'll reconsider my opinion on this.

Are there facts to back up what he says about MongoDB for example or is this just FUD.

His assertion is actually backed up by MongoDB docs: "Sharding will be production-ready in MongoDB v1.6, estimated to be released in July 2010." Their current sharding alpha is only tested/supported for small clusters (up to 20 nodes). In addition, looking at their sharding mode of operations, only a very limited set of actions are properly sharded; the rest require touching all the shards.

Disclaimer: this was the first look I've had at MongoDB. However, based on this alone, it would be prudent to wait several years before considering using this technology in serious production. It is somewhat reckless to consider a DBMS feature that wasn't locked for awhile "production-ready".

better yet, why isn't Quora using PostgreSQL? :)

I finally made the switch from MySQL to Postgres for my current product. The impetuous was licensing as I got burned by MySQL's ambiguous licensing on a prior product. Now that I have gotten over the learning curve I like Postgres about the same as MySQL. Both are good products but you can move Postgres from a Saas to an onsite deployment as needed without any licensing issues.


Can't answer for Quora, but I know why I'm not: activation energy. My team and I know MySQL pretty well, and we can stand by our decision to use it. PostgreSQL we can't.

It's not the most sound technical reasoning.

Postgres is easy to convert to from MySQL, if you keep that conversion cheatsheet handy. However, mastery of its performance quirks (all servers have quirks) can take years.

It took me 48 hours to convert to it, 10 months later and I don't feel like I know much about it.

Isn't running a system for over 10 months without being forced to learn all the innate details a plus? Gives you nice 'look at the uptime of this baby' bragging rights without too much effort ;-)

I have this question about a lot of sites. I've heard very few good things about MySQL and lots about PostgreSQL but MySQL seems to be used more and get more press.

It's just inertia ... I am familiar and worked with both ... but I still chose Mysql over Postgres simply because I have more confidence that I can control it.

That's partly because Mysql always was more user friendly, and I also attended one Mysql conference and got to know some people working on Mysql.

That said, overall Postgres is better, with one exception ... Mysql supports multiple storage engines for various needs. In Postgres you don't have that flexibility.

> That's partly because Mysql always was more user friendly

Examples? I started with Postgres and ignored MySQL from day one because of MySQL's deficiencies, and have had 0 problems learning what I needed in the years I've used it. Not to mention I can read a general SQL reference, or the standard, and the information actually applies. Unlike MySQL (quotation marks anyone?).

> That said, overall Postgres is better, with one exception ... Mysql supports multiple storage engines for various needs. In Postgres you don't have that flexibility.

"various needs": what, for those times when you don't care whether your data gets silently corrupted?

> "various needs": what, for those times when you don't care whether your data gets silently corrupted?

I never had data-corruption in Mysql. Of course, I'm always working with InnoDB, maybe MyIsam is weaker

And if Mysql would have been such a weak RDBMS, maybe it wouldn't have been used by companies like Google or Amazon (especially Amazon RDS, which are already configured Mysql instances offered to clients in Amazon's cloud).

> "various needs": what, for those times when you don't care whether your data gets silently corrupted?

Ebay is using the memory tables engine in Mysql instead of Memcached. According to their benchmarks, the Mysql client is more stable, and the combination is more scalable.

There's also a whole ecosystem of third party storage engines. Like when Adobe developed an in-house distributed storage ... they also had an (internal project) storage engine for Mysql that interacted with it.

Also, the soon to be replacement for InnoDB is Falcon ... http://forge.mysql.com/wiki/Falcon (optimized for multiple cores).

Or RethinkDB (YC startup) ... a storage engine optimized for solid-state drives ... http://rethinkdb.com/

And I don't really care about what others think ... Mysql served me well.

There are a decent number of things for which I don't care if the data gets corrupted a bit over the long run. Performance is a much bigger consideration.

PostgreSQL has better performance on real-world, high-concurrency loads over multiple cores. MyISAM is slow on tables with lots of writers (table level locking). InnoDB is better, but PostgreSQL is even better. Supposedly MySQL 5.4 is going to fix this. Move into big queries with lots of joins, subqueries, using multiple indexes, and PostgreSQL starts to shine even more.

You can even turn on async commit (since 8.3), which sends a commit acknowledgement message back to the client before the transaction is fully committed, allowing sync clients to move forward. While this does allow a very small window of time where, if the power was pulled on a machine, you'd lose just that bit of data for the tenth of a second it wasn't written to disk. All the other data is safe though, unlike MyISAM's default fsync=off, which gives it all of it's "performance." PostgreSQL also has async network I/O support baked into the libpq driver, which, if the client is written properly, makes this unnecessary. Note that async commit can be set per-connection and toggled on and off.

Virtually all of the 7.x releases were focused on durability, and 8.x releases were focused on performance, culminating with 9.0, due out soon, which will bring the much-needed readable slaves. This is the final feature that will tip the scalability gap in PostgreSQL's favor.

There are is a situation where MySQL shines: read-only MyISAM tables with simple queries where joins are not used. Virtually all of the write concurrency and data integrity features as well as PostgreSQL's performance using multiple indexes and joins become unimportant at this point. MyISAM also supports compressed read-only tables, which are great for this situation. I'd have to ask myself why I wasn't using a simpler NoSQL solution at this point though.

Yeah, MySQL could definitely be improved upon. The table-level write locking is a pain. I was just saying that 100% consistency/data integrity isn't really a major concern for a lot of applications.

I've thought about switching to Postgres, but it doesn't currently appear to be worth the pain in my case.

Locking in Innodb is row-level.

Yeah, I know, our conversation about read performance at the cost of data integrity being OK was about MyISAM (which doesn't have great concurrent write performance due to the table-level locking).

Marketing driven data. It starts already corrupted.

For any data you want to reliably read back, MySQL really has one (standard) engine†. Nearly every MyISAM DB I've used in a high traffic environment has eventually corrupted. Your choice is really either InnoDB or ‘repair table’.

† NDB may be a perfectly reliable option, I have not used it.


There's no need to burden yourself with making a system scale more than 10x further than it needs to, as long as you're confident that you'll be able to scale it as you grow.

It should be noted that the 10x quote comes from Jeff Dean. Not only does it make sense from the business point of view, it's also nearly impossible to scale a system 100x: you simply have no idea what the traffic patterns and bottlenecks will be at 100x the volume.

In other words, going from 60k uniques to 600k uniques can be possible with the same architecture. Thinking you can go 60k to 6,000,000 on the same architecture is hubris (as is planning for the 6,000,000 mark while you're at 60k).

It's also extremely annoying to see the what started as people looking into right tools to do their jobs dissolve into a hype bandwagon (but I guess it happens to any technology and I'll be a happier person if I ignore this :-)). First the hammer for all persistence was MySQL + an ORM (ignoring both other RDBMS e.g., Postgres, approaches such as BerkeleyDB), then it was CouchDB now it seems to be Mongo or Cassandra. These are entirely different systems, each has a use case where it shines. Why are people all the sudden demanding that others use a "NoSQL solution"?

That being said, MySQL isn't that hot in terms of reliability (of course starting off with Oracle is recipe for some technical debt and vendor lock-in). There's no substitute for proper backups and administration. This is more of an argument for MySQL (or my personal favourite, Postgres) than against it: with MySQL the backup and administration solutions are very well known.

Some of the points mentioned in the article in regards to distributed systems are also fairly confusing and incorrect. CouchDB and Mongo are (scalability-wise) essentially the same as MySQL: "potentially consistent" replication (no ability to perform quorum reads, no guarantee of eventual consistency, no version vectors) and not truly distributed (a 1000 cluster of consisting of 500 "shards" two replicas each is really 500 separate clusters). MySQL already has very relaxed semantics (a master-slave replication system makes absolutely _no_ guarantees about commits made to master being consistently read from the slaves or even being persisted on the slave if the master crashes).

Dynamo and BigTable inspired systems are radically different. However, as with any other system under active development (which also includes RDBMS such as VoltDB), unless you're ready to ask a committer on one of these systems to institute a special QA cycle and support for your use case (or you are a committer on one of the systems, in which case you know what you're doing) when using one of these systems as a primary datastore, you're taking on a technical risk. A technical risk that's not needed when you need to persist data for 60k monthly uniques: I can do that just fine with any system. Depending on the use patterns of the application, you can also do for 60,000,000 monthly uniques while sticking with MySQL (while most other parts of the applications, including the data access layer to MySQL, would have to be re-written many times over).

In the end Adam's decision is probably the correct decision (I don't work for Quora, so I can't tell), but the technical reasons sound slightly ignorant. It's perfectly fine to be ignorant of these topics, it's not, however, okay to speak authoritatively on a topic you're ignorant of (e.g., I know nothing of machine learning, so I am not going to speculate on how Quora should be using machine learning to make their homepage more relevant -- which they probably are -- I'd add nothing to the conversation).

Jeff Dean (for those wondering): http://research.google.com/people/jeff/index.html

> Schemas allow the data to persist in a typed manner across lots of new versions of the application as it's developed, they serve as documentation, and prevent a lot of bugs.

So, schemaless vs. schemaful is similar to static typed vs. dynamic typed? (or is it more like typed vs. untyped?)

> I think the "NoSQL" fad will end when someone finally implements a distributed relational database with relaxed semantics.

What does he mean by "relaxed semantics"?

Also interesting is citing Facebooks giant usage of MySQL (1800 servers, 2 DBAs) but ignoring that they advocate pretty strongly for schemaless structures.

The built-in MySQL schemas have problems - mostly that schema changes can take hours or days on large tables. As a result of this you need to build a system that doesn't require schema changes at the MySQL level. This doesn't mean you can't use schemas at a higher level though.

There are two ways to build on top of MySQL - you can either make a flexible schemaless structure with objects and properties and associations between them, which is what Facebook advocates for, or you can make a flexible virtual schema that has column names, types, and arbitrary indexes. I'm in favor of the latter.



I'm guessing downmodders aren't aware of this article? http://queue.acm.org/detail.cfm?id=1394128

Isn't it possible to configure MySQL server and data to run in memory if we've enough RAM?

Yep, but I believe you have to handle persisting to disk yourself (there might be a Redis-like solution to this that I'm not aware of, though).

What does Redis do?

It's an in-memory hash store that is persisted to the disk when possible. Sort of like memcached that can survive restarts and is suitable for long-term storage.

Well, I know that much. Did you mean that Redis just persists when you ask it to?

(I was counting that under "you have to do it yourself" and thought it might do something more automatic but still infrequent.)

Ah OK. I believe redis persists automatically when it can, from the memory, though I haven't actually used it. With a MySQL memory table, if you want to write to the memory table AND have it persist, you have to keep track of what has changed yourself, and handle writing those things to disk.

Indeed, and it's quite simple to do this on a table-by-table basis.

Yes, construct your tables with the MEMORY storage engine.

it's amusing that the article now errors out...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact