
Why Quora uses MySQL rather than No-SQL - bokonist
http://www.quora.com/Why-does-Quora-use-MySQL-as-the-data-store-rather-than-NoSQLs-such-as-Cassandra-MongoDB-CouchDB-etc.
======
fookyong
I find this answer incredibly excellent.

particularly:

 _You can actually get pretty far on a single MySQL database and not even have
to worry about partitioning at the application level. You can "scale up" to a
machine with lots of cores and tons of ram, plus a replica. If you have a
layer of memcached servers in front of the databases (which are easy to scale
out) then the database basically only has to worry about writes. You can also
use S3 or some other distributed hash table to take the largest objects out of
rows in the database. There's no need to burden yourself with making a system
scale more than 10x further than it needs to, as long as you're confident that
you'll be able to scale it as you grow._

I've dealt with some insanely convoluted legacy systems that were made up of a
fragile bunch of badly configured clusters ("for scalability!") that I
improved the speed and load beariness by 100000x simply by putting it on a
single server, tons of ram, and combing through the slow query log.

~~~
rbranson
Sounds like pre-mature optimization, a bad idea which hackers need to be
constantly reminded of.

~~~
fookyong
yes that's exactly what it was.

it was outsourced to a vendor that pretty much took advantage of the
contracting startup's lack of DB knowledge. so they got the kitchen sink and a
fat invoice.

of course it's easy for the vendor to defend this - "what, we want you to be
scalable!".

~~~
jseifer
Flip side: How many startups ask for a solution that will scale to millions of
users per day once their app goes live?

------
wheels
It's a site with in the ballpark of 60k uniques per month. What they use is
pretty irrelevant. Really all this reduces to is, "it's dumb to optimize for
problems you don't have."

Assuming 100 requests per user per month, 60k uniques is 2 requests per
second. They could run the whole site on one Linode unless they've got a
particularly bad architecture.

~~~
teej
> 2 requests per second

Your point is valid, but it doesn't work like that.

~~~
vog
Do you care to elaborate?

~~~
teej
Taking daily pageviews and dividing it by 24 * 60 * 60 assumes your traffic
curve isn't a curve at all - it's flat. For your average US-based consumer web
app it's highly unlikely to have just as many people using your app at 3am as
you do 3pm.

When scaling a web app, you need to prepare for peak usage, not average usage.

Here are a few first-hand examples. Friends for Sale, a social game, had ~10M
daily pageviews (100 req/sec avg) but had peak load of > 200 req/sec. Ideeli,
a "flash sale" designer shopping site, typically had peak load 5x-20x higher
than median load.

Regardless of what Quora's traffic curve looks like, wheels' point still stand
- it doesn't matter what backend you're using for 60k users on a web app.

\--------------

[http://highscalability.com/friends-sale-
architecture-300-mil...](http://highscalability.com/friends-sale-
architecture-300-million-page-view-month-facebook-ror-app)

~~~
zepolen
So worst case they will be doing 4 req/s instead of 2.

I know what you're trying to say, but at Quora's numbers, as OP said, it truly
is irrelevant.

~~~
Retric
No, 5-20x * 2req/s av = 10 to 40 req/s.

~~~
zepolen
Peak on a decent traffic site will be roughly 2x-3x the average. 5x-20x would
be a blog that gets 10 visitors a day and after being posted on news.yc gets
100 visitors.

------
almost
I don't understand all these "why we're not using NoSQL" posts. Why is it
interesting that someone choose the default choice? Why would they use a NoSQL
database of some sort for a major project unless they actually had a need for
it? I'm half expecting a "why we don't use Brainfuck"* post one of these
days...

*Not that I'm comparing any NoSQL database or NoSQL databases in general to Brainfuck, it's just that I don't see the point of "why we don't use X" posts when there was no particular reason to suppose that they might use X in the first place.

~~~
brown9-2
I'm not 100% sure how Quora works but is this a post or an answer to a
question that another person posted?

~~~
dangelo
It's an answer to a question that someone else posted. I wouldn't have written
it otherwise.

------
weixiyen
> 2\. These distributed databases like Cassandra, MongoDB, and CouchDB[3]
> aren't actually very scalable or stable. Twitter apparently has been trying
> to move from MySQL to Cassandra for over a year. When someone reports using
> one of these systems as their primary data store for over 1000 machines for
> over a year, I'll reconsider my opinion on this.

Are there facts to back up what he says about MongoDB for example or is this
just FUD.

~~~
villiros
His assertion is actually backed up by MongoDB docs: "Sharding will be
production-ready in MongoDB v1.6, estimated to be released in July 2010."
Their current sharding alpha is only tested/supported for small clusters (up
to 20 nodes). In addition, looking at their sharding mode of operations, only
a very limited set of actions are properly sharded; the rest require touching
all the shards.

Disclaimer: this was the first look I've had at MongoDB. However, based on
this alone, it would be prudent to wait several years before considering using
this technology in serious production. It is somewhat reckless to consider a
DBMS feature that wasn't locked for awhile "production-ready".

------
rbranson
better yet, why isn't Quora using PostgreSQL? :)

~~~
generalk
Can't answer for Quora, but I know why I'm not: activation energy. My team and
I know MySQL pretty well, and we can stand by our decision to use it.
PostgreSQL we can't.

It's not the most sound technical reasoning.

~~~
mahmud
Postgres is easy to convert to from MySQL, if you keep that conversion
cheatsheet handy. However, mastery of its performance quirks (all servers have
quirks) can take years.

It took me 48 hours to convert to it, 10 months later and I don't feel like I
know much about it.

~~~
c1sc0
Isn't running a system for over 10 months without being forced to learn all
the innate details a plus? Gives you nice 'look at the uptime of this baby'
bragging rights without too much effort ;-)

------
strlen
_There's no need to burden yourself with making a system scale more than 10x
further than it needs to, as long as you're confident that you'll be able to
scale it as you grow._

It should be noted that the 10x quote comes from Jeff Dean. Not only does it
make sense from the business point of view, it's also nearly impossible to
scale a system 100x: you simply have no idea what the traffic patterns and
bottlenecks will be at 100x the volume.

In other words, going from 60k uniques to 600k uniques can be possible with
the same architecture. Thinking you can go 60k to 6,000,000 on the same
architecture is hubris (as is planning for the 6,000,000 mark while you're at
60k).

It's also extremely annoying to see the what started as people looking into
right tools to do their jobs dissolve into a hype bandwagon (but I guess it
happens to any technology and I'll be a happier person if I ignore this :-)).
First the hammer for all persistence was MySQL + an ORM (ignoring both other
RDBMS e.g., Postgres, approaches such as BerkeleyDB), then it was CouchDB now
it seems to be Mongo or Cassandra. These are entirely different systems, each
has a use case where it shines. Why are people all the sudden demanding that
others use a "NoSQL solution"?

That being said, MySQL isn't that hot in terms of reliability (of course
starting off with Oracle is recipe for some technical debt and vendor lock-
in). There's no substitute for proper backups and administration. This is more
of an argument for MySQL (or my personal favourite, Postgres) than against it:
with MySQL the backup and administration solutions are very well known.

Some of the points mentioned in the article in regards to distributed systems
are also fairly confusing and incorrect. CouchDB and Mongo are (scalability-
wise) essentially the same as MySQL: "potentially consistent" replication (no
ability to perform quorum reads, no guarantee of eventual consistency, no
version vectors) and not truly distributed (a 1000 cluster of consisting of
500 "shards" two replicas each is really 500 separate clusters). MySQL already
has very relaxed semantics (a master-slave replication system makes absolutely
_no_ guarantees about commits made to master being consistently read from the
slaves or even being persisted on the slave if the master crashes).

Dynamo and BigTable inspired systems are radically different. However, as with
any other system under active development (which also includes RDBMS such as
VoltDB), unless you're ready to ask a committer on one of these systems to
institute a special QA cycle and support for your use case (or you _are_ a
committer on one of the systems, in which case you know what you're doing)
when using one of these systems as a primary datastore, you're taking on a
technical risk. A technical risk that's _not_ needed when you need to persist
data for 60k monthly uniques: I can do that just fine with any system.
Depending on the use patterns of the application, you can also do for
60,000,000 monthly uniques while sticking with MySQL (while most _other_ parts
of the applications, including the data access layer to MySQL, would have to
be re-written many times over).

In the end Adam's decision is probably the correct decision (I don't work for
Quora, so I can't tell), but the technical reasons sound slightly ignorant.
It's perfectly fine to be ignorant of these topics, it's not, however, okay to
speak authoritatively on a topic you're ignorant of (e.g., I know nothing of
machine learning, so I am not going to speculate on how Quora should be using
machine learning to make their homepage more relevant -- which they probably
are -- I'd add nothing to the conversation).

~~~
mrduncan
Jeff Dean (for those wondering):
<http://research.google.com/people/jeff/index.html>

------
10ren
> Schemas allow the data to persist in a typed manner across lots of new
> versions of the application as it's developed, they serve as documentation,
> and prevent a lot of bugs.

So, schemaless vs. schemaful is similar to static typed vs. dynamic typed? (or
is it more like typed vs. untyped?)

> I think the "NoSQL" fad will end when someone finally implements a
> distributed relational database with relaxed semantics.

What does he mean by "relaxed semantics"?

~~~
brown9-2
Also interesting is citing Facebooks giant usage of MySQL (1800 servers, 2
DBAs) but ignoring that they advocate pretty strongly for schemaless
structures.

~~~
dangelo
The built-in MySQL schemas have problems - mostly that schema changes can take
hours or days on large tables. As a result of this you need to build a system
that doesn't require schema changes at the MySQL level. This doesn't mean you
can't use schemas at a higher level though.

There are two ways to build on top of MySQL - you can either make a flexible
schemaless structure with objects and properties and associations between
them, which is what Facebook advocates for, or you can make a flexible virtual
schema that has column names, types, and arbitrary indexes. I'm in favor of
the latter.

------
known
Isn't it possible to configure MySQL server and data to run in memory if we've
enough RAM?

~~~
ericd
Yep, but I believe you have to handle persisting to disk yourself (there might
be a Redis-like solution to this that I'm not aware of, though).

~~~
astrange
What does Redis do?

~~~
ericd
It's an in-memory hash store that is persisted to the disk when possible. Sort
of like memcached that can survive restarts and is suitable for long-term
storage.

~~~
astrange
Well, I know that much. Did you mean that Redis just persists when you ask it
to?

(I was counting that under "you have to do it yourself" and thought it might
do something more automatic but still infrequent.)

~~~
ericd
Ah OK. I believe redis persists automatically when it can, from the memory,
though I haven't actually used it. With a MySQL memory table, if you want to
write to the memory table AND have it persist, you have to keep track of what
has changed yourself, and handle writing those things to disk.

------
ianlevesque
it's amusing that the article now errors out...

