
MySQL is a Better NoSQL - yoava
http://engineering.wix.com/2015/12/10/scaling-to-100m-mysql-is-a-better-nosql/
======
JonahBraun
Interesting that he advocates MySQL. PostgreSQL can index and interact with
the JSON stored in it's columns, which make it a better choice as a NoSQL
replacement.

[http://www.postgresql.org/docs/9.4/static/datatype-
json.html](http://www.postgresql.org/docs/9.4/static/datatype-json.html)

~~~
javajosh
The JSON datatype is only useful if you want to query on the blob, which is
not the case in Wix's case.

~~~
staz
But you tell to keep columns that need indexs outside of the JSON so you can
query on them, which is not necessary with Postgres

------
hoodoof
I did an extensive survey of NoSQL databases recently and for my particular
requirements Postgres ended up being the best choice. I examined MySQL too in
this particular survey. Can't remember why but its JSON handling didn't meet
my needs.

Every time I consider alternative databases -and I've done so several times in
the past ten years - I do a thorough examination and the answer always seems
to come back to Postgres. I'm not a particular Postgres fanboy but the answer
just ends up at the same place.

~~~
namelezz
When and how did you evaluate PostgreSQL? Can you comment on PostgreSQL's
horizontal scale?

~~~
hoodoof
I cannot comment on its horizontal scaling.

I can say however that Postgres, from what I understand, has nearly linear SMP
vertical scaling up to 64 cores. There's an awful lot of headroom there. I
read a long while back Jeff Atwood saying that StackOverflow ran on a single
vertically scaled database server with MSSQL for a long time.

If I needed to go beyond vertical then I'd maybe horizontally scale in a
custom manner based around the application data access characteristics, which
plays a huge role in designing sensible horizontal scaling solutions.

~~~
namelezz
Thank you for your response. I was asking the questions because when I was
evaluating MySQL and PostgreSQL 3 years ago I ended up choosing MySQL due to
its community and the rumor about difficulties in horizontal scaling of
PostgreSQL even though I really liked PostgreSQL JSON support. I hear
PostgreSQL is getting better at scaling horizontally although I do not know
how good it is now.

~~~
hoodoof
Horizontal scaling comes with a who range of tradeoffs that you need to make
and exactly how you horizontally scale I think has alot to do with how your
application stores, reads and writes data.

------
jhall1468
I'm going to question your knowledge of the domain immediately when your
solution to a problem is to take away all of the advantages of SQL without any
of the benefits that Cassandra or Redis provide.

To me this is akin to saying we don't need Haskell because Java now has lambda
expressions.

~~~
yoava
What are the benefits of Cassandra or Redis that I am missing, for this use
case? We have the latency figures as good as Cassandra can get and we get a
reliable engine to store data (something that Redis is not - read Aphir post
about Redis)

The main advantage of MySQL that we keep is the rock solid platform with all
the know-how to operate and manage.

~~~
jhall1468
For one, Cassandra's read/write throughput scales linearly as you add nodes.

I would agree that your use case is narrow, and as such MySQL _works_. But
that doesn't mean it's ideal, and if your use case changes is objectively
worse.

Most importantly you made a claim: "MySQL is a Better NoSQL". Where's the data
for your claim? You found a niche use-case where using MySQL as a KV store
_works_ but then claimed it's better.

If I see a nail, and you tell me whacking it with a crowbar is better than a
hammer, you need to show me why. All you gave me was a schematic on how to use
a crowbar as a hammer.

------
DigitalSea
MySQL is great and all, but if you want a better database that is considerably
more powerful and supports NoSQL type scenarios: PostgreSQL. The fact that
Postgres has supported working with JSON like key/value store data for a while
now and it is quite decent, as well as giving you other options if you decide
you want the traditional approach of tables.

~~~
yoava
When we started with this use case, at 2008, PostgreSQL did not look so good.
Today, to be honest, I'd probably try it for new projects, except for the vast
experience with MySQL that we have at Wix

~~~
hitekker
If interested, You may wish to make a first level comment so people can reply
and ask questions about your blog post.

------
cortesoft
I really hate any post that says "you don't really need this tool, this one
does that job just fine for us"

You don't know my workload. You don't know my requirements. It is ok to make a
post saying that a specific tool is wrong for these use cases, or that you
believe that many people using a tool don't need it or would be better off
with a different tool..... But don't act like you know my workload.

~~~
parent5446
Did you read the article? It isn't saying that at all.

Even the introduction clearly explains that it is addressing a trend of
developers using NoSQL because of hype rather than actually evaluating their
use cases, and that the remainder of the article is how Wix found MySQL better
for their specific scenario. They even give tips on when to know if MySQL is
good for you for this use case.

It does not attempt to make sweeping statements about NoSQL or MySQL, nor does
it prescribe a solution for every workload.

~~~
jhall1468
> It does not attempt to make sweeping statements about NoSQL or MySQL, nor
> does it prescribe a solution for every workload.

The title is _literally_ a sweeping statement that MySQL is better. The first
sentence in the introduction implies that the key-value store is an example of
when _MySQL is better_.

Title: _Scaling to 100M: MySQL is a Better NoSQL_

First Sentence: _MySQL is a better NoSQL. When considering a NoSQL use case,
such as key /value storage, MySQL makes more sense in terms of performance,
ease of use, and stability._

That's nothing if it isn't a sweeping statement.

~~~
parent5446
This is precisely why I asked if you read the article. But it seems you only
read the title and the first sentence.

~~~
jhall1468
I'm not who you responded to, but for the record I read the entire article.
The reason I pointed out the title and first sentence, is because those were
where the op made sweeping statements, which you claimed he didn't.

A little defensive aren't we?

------
tylermauthe
_> Fields only exist to be indexed. If a field is not needed for an index,
store it in one blob/text field (such as JSON or XML)._

A cautionary tale... It can be tempting to store serialized objects in your
DB. This can be very dangerous, both for debugging and for migrations. Your
data is coupled to your application in subtle ways and you can get stale
references if you store any relationships in the "blob". JSON or XML isn't so
terrible, at least a human can read that -- using a true BLOB is a kiss of
death. A database is a handy abstraction for storing data, I highly recommend
that you use this abstraction (and all abstractions) for humans first and
robots second.

Granted, this is an article about optimization and in that regard I'm sure it
yields benefits. If you must squeeze every last drop of juice out of a MySQL
server, then do this -- just know that it comes with a price.

------
halayli
> "MongoDB, Cassandra, Redis, or Hadoop"

Adding Hadoop in the mix tells me he doesn't know what a NoSQL is.

~~~
p4wnc6
I'm not sure that I understand your point. It seems very reasonable to me to
consider a raw HDFS as a type of NoSQL data storage mechanism, with other
components of the Hadoop framework acting on top of it as query languages and
other utilities.

Of course we can bikeshed all day and split hairs about the precise
definitions, but in the spirit of pragmatism, I would absolutely put "Hadoop"
as a form of NoSQL storage solution. In fact, I'd say the most distinct of the
items mentioned is Redis, and Redis is the tool from this list that has use
cases that tend to differ the most significantly from traditional RDBMSs (I'm
thinking specifically cases where Redis is used as a buffering layer to help
relieve congestion for a highly visited base data store). But even so, I still
don't think it represents any sort of taxonomic error to put them all in the
NoSQL category, at least for practical purposes.

~~~
halayli
hadoop != HDFS is my point.

------
chad_walters
MySQL may work well for this small data set (200GB). Start working with 10s of
TBs of data and you will start to understand why NoSQL stores were built.

~~~
austinsharp
My thought exactly. This is just a scale that can be solved either way; when
you really can't fit your data on even a handful of machines with acceptable
performance, then Cassandra can start to shine.

------
avitzurel
This is right and wrong in the same time. Not sure which more.

Storing JSON as TEXT is great, but you really only query the data based on the
mySQL index.

What if you need to query based on the site_data?

This is really not NoSQL this is just a key value store with a JSON object
that is not even native type to the database, you will still need to parse
back/forth.

What about updating the data? You need to get the object and then you need to
parse it, change it and send it back marshalled to TEXT that MySQL can
understand.

Even as a KV store, using MySQL simplifies a lot of things but there are MUCH
better solutions out there.

[EDIT] One other thing that this fails to mention is that TEXT in MySQL will
in most cases go to disk to get the "document" which will be slower.

In Wix case, getting a website and having all the data available this is most
likely not relevant, but if you want to get multiple rows at a time, this will
become a consideration for you.

~~~
ngrilly
Use the last version of MySQL 5.7 that supports JSON natively:

[http://dev.mysql.com/doc/refman/5.7/en/json.html](http://dev.mysql.com/doc/refman/5.7/en/json.html)

------
aidos
Quick question - I'm confused about the comment "The nested query syntax
ensures that we are doing only one round-trip to the database to run both SQL
queries". Why is the nested syntax better than a regular join in this case?

Seems like a join would allow the query planner to decide how to carry out the
query.

~~~
meritt
Author seems to be confused on how databases operate (especially since they
really should just be using a KV store like Redis), because you're exactly
right, the query would be the same as:

    
    
        select sites.* from sites join routes using (site_id) where route_id = ?
    

Assuming sites.site_id and routes.route_id are both primary keys, this query
is going to perform identically using either syntax. It'll read 1 row from
each table.

They could see a further performance gain by placing an index on routes
(route_id, site_id) since the site_id could be retrieved from the b-tree and
avoid a table hit entirely. But regardless, query syntax will not affect
performance here.

~~~
oysteing
You are right that this join query should be equivalent. However, with InnoDB
there will not be much benefit from your suggested index. In InnoDB primary
indexes are clustered. Hence, any column may be retrieved from the B-tree of
the primary index.

~~~
jfrisby
Only if the columns are stored on-page with the index, which is not guaranteed
-- it depends on what other columns exist in the table, etc.

------
trustfundbaby
Title feels a bit faceitous because the article's use case is more akin to a
key value store, vs nosql in the schemaless, mongodb/cassandra sense.

Good write up though.

------
weddpros
If sharding, rebalancing and partitions didn't exist... And if 1600 rps were
enough when nosql can reach 1M rps (600 times more)... And if that homemade
replication infrastructure was as mature and supported as nosql dbs...

Then using MySQL as a nosql might work.

~~~
lotyrin
This comparison is a dangerous one without some kind of qualification. This
performance difference is not inherent to the design of these systems like you
seem to imply.

What use case and hardware are your hypothetical relational/non-relational
database under where you get your 600 times speed-ups?

I can run a benchmark of a few hundred fast machines with sharded sqlite
databases doing key-value and operating in RAM and get large numbers (any
number I want), but they don't mean anything.

~~~
weddpros
I brought the performance argument (among others) because the article does.
When I read 100k I thought it was rps, not rpm. It would have been an
achievement.

The achievable performance of dbs is linked to their horizontal scalability,
and SQL does not scale horizontally because of the relational model. Subsets
of SQL can be made to scale horizontally, like in Cassandra, without locks,
transactions, joins, aggregations, etc... or with a homemade cluster like the
one described in the article.

But it's not easy to shard a mysql or sqlite system. It's very hard to
rebalance a cluster. It's very hard to make it work during partitions. Paxos
or Raft are difficult. Most nosql dbs do that for you. MySQL doesn't.

So the argument of a "battle tested db" is weak if a mysql system must use
custom built cluster management (custom built != battle tested)

Online games are a good use case where you need very high "50% write/50% read"
throughput, but it's only one among others. Logging/timeseries is another use
case, with 99% write.

As for the sqlite cluster: will you implement all the
sharding/rebalancing/partition tolerance/CP database too? If not, you're
comparing apples and oranges again.

MySQL has merits, but nosql dbs have different design goals: judging by the
"battle tested" argument totally missed the point.

BTW, even a modest (2dbs + 1arbiter) mongodb cluster would handle 1600rps
easily, and you'd get automatic failover and replication for free, with
sharding baked-in if you need it tomorrow.

~~~
lotyrin
That's the thing though, I _can_ do 100k transactions per second with an
RDBMS, it's not an achievement.

Whether a DBMS allows relations or not or uses SQL or not is an independent
property of whether it has built in dynamic
schema/graph/replication/failover/query
distribution/sharding/rebalancing/distributed consistency solutions. And
whether those solutions are built-in has little to do with whether solving
them is possible.

If it's just that we are disappointed that the traditional database systems
(which happen to be relational and SQL because these are elements of that
tradition) feel like they don't need to solve these problems, then I
absolutely agree.

The performance benefits of flattening your data model, avoiding indexing
things you don't query against, sharding, using a distributed map reduce, etc.
can all be had with SQL and an RDBMS, so speed is a poor argument for straying
from the tradition, IMO.

If it's really about wanting a new generation of comprehensive platforms for
solving distributed data management (all those features above) that's a great
argument, but somehow it's always about speed. Of course distributing
asynchronous writes across a cluster of cheap cloud VMs starved for disk IO is
faster than single-point synchronous writes on one of those VMs, but is the
operations overhead of orchestrating and monitoring that cluster really
cheaper than provisioning the hardware it'd take to do the same throughput
with a simpler traditional system? Not as often as I'd be had to believe.

------
mbfg
This use case seems to be highly read oriented, which is fine, but really
diminishes the value of distributed 'nosql' stores. The really hard thing to
scale is massive writes. Cassandra's big differentiator, for instance, is it's
ability to horizontally scale massively when it comes to writes. a million
client writes per second, replicated over zones, type of performance.

Oh and by the way, 'vertical' scaling is kind of a silly concept. That just
means, "i wasn't an idiot".

------
seanwoods
Reminds me of:

[http://backchannel.org/blog/friendfeed-schemaless-
mysql](http://backchannel.org/blog/friendfeed-schemaless-mysql)

------
lemmsjid
As this title demonstrates, NoSQL is no longer a useful term. (It was useful
for a period of time when SQL-based RDBMs systems were quite predominant, so
it could be used as a gross differentiator).

Now the best way to think about it is that there are database platforms with
varying features, one of which is support for SQL. When you evaluate which
platform to use, you should have a list of business-derived criteria, such as
SQL support, support for various relational integrity constructs (i.e. foreign
keys), latency for typical queries (e.g. Hive), fault tolerance, ACID
compliance, partitioning schemes, read or write optimization, etc. and act
accordingly.

"NoSQL" used to be shorthand for a vague subset of database features that
usually involved relaxing data protection in favor of multi-server
scalability, but now it just muddies the waters, especially as many NoSQL
platforms now support SQL or a subset thereof.

~~~
yoava
Big thumb up!!!

------
rakoo
Might as well go the full route:

    
    
      CREATE TABLE `sites` (
        `site_id` varchar(50) NOT NULL,
        `owner_id` varchar(50) NOT NULL,
        `schema_version` varchar(10) NOT NULL DEFAULT '1.0',
        `site_data` text NOT NULL,
        `last_update_date` bigint NOT NULL,
        `route` varchar(255) NOT NULL,
        PRIMARY KEY (`site_id`)
      ) /*ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=16*/;
    
      select * from sites where route = ?
    

No join (truly, this time), no need for transaction (whether in the DB or
softly in the application). If you want improvements in performance you need
to stop using your DB as a way to model your _data_ and start using it as a
way to model your _queries_. Considering that the _data_ is already stored in
the site_data field, they already started on this path.

------
carsongross
This statement is nearly meaningless without context.

YesSQL is better than NoSQL _for some problems_ (I happen to think many of
them.)

NoSQL is better than YesSQL _for some problems_ (I happen to think more of
them than I thought a few years ago.)

I'm glad that YesSQL worked well for Wix in this case.

------
niutech
This was done by FriendFeed back in 2009:
[https://backchannel.org/blog/friendfeed-schemaless-
mysql](https://backchannel.org/blog/friendfeed-schemaless-mysql)

------
mkadlec
What about ease of setup? I'd take the NoSql setup any day. If in need of a
super light, super quick db that is for a "non-enterprise" application, it's
tough to beat NoSql...

~~~
yoava
Which NoSQL are you referring too? Most are not that easy to implement (saying
that while managing a few PBs of data in production with a few such tools)

------
dirkdk
that mapping from routes to sites seems also great to be stored in a Memcached
machine, as it is probably set once and then stays the same for months if not
years. Memcached and MySQL is always a great combo

~~~
yoava
True. Memcached is a great tool and great for this use case. But when you get
read latency of ~0.3 mSec from MySQL, why add another hop and another engine
to the mix?

------
nfirvine
"Do not perform table alter commands. Table alter commands introduce locks and
downtimes. Instead, use live migrations."

You know what avoids alter? Going schemaless.

~~~
sitharus
Or using a more appropriate RDBMS that allows for online and transactional
ALTER commands?

------
sagivo
Great post Yoav! Thanks for sharing

------
jonthepirate
+1 for "do not use joins"

~~~
jondumbau
why

~~~
jerrysievert
expense - see [http://stackoverflow.com/questions/2065754/is-there-any-
gene...](http://stackoverflow.com/questions/2065754/is-there-any-general-rule-
on-sql-query-complexity-vs-performance) which actually shows it fairly well.

(the stackoverflow response is a much better response than i could have
written, hence the link)

~~~
lokedhs
Except that he did use a join in the article itself (formatted as a subquery,
but still a join)

