

Reading Between the Benchmarks: How MemSQL Designs for Speed and Durability - nikita
http://developers.memsql.com/blog/reading-between-the-benchmarks/

======
dhruvbird
A few questions Ankur:

1\. How many rows were actually present in the tables (in all)?

2\. What did the VSZ and RSS for mysqld and memsqld(?) look like at peak and
after benchmarking was completed?

3\. What did the CPU usage of each mysqld and memsqld(?) look like at peak and
what was the total amount of processor time used by each database?

4\. As a percentage, how many statements were inserts and how many were
queries - am guessing I might be able to get an idea my looking at the code,
but it's easier to ask - esp. since you'll know more about the code.

5\. AUTOCOMMIT ON or OFF? (dunno what the defaults are).

~~~
ankrgyl
Hey Dhruv,

In the case of the 24 core machine...

The database is initialised with 560000 players. For every action an event is
written to the database, so at the end of the 10 minute run there are ~aps *
600 = 1.2 million rows for MySQL and 40 million rows for MemSQL. Then there's
the active games: MySQL ends with a million active games, and MemSQL ends with
the steady of 5.6 million games. The total number of rows at the end is around
2.8 million for MySQL and around 48 million for MemSQL.

MySQL VSZ ends and peaks at 6 gB, RSS ends and peaks at 2GB. For MemSQL, the
RSS and VSZ end and peak at 18 gB and 19 gB, respectively.

MySQL keeps each of the 24 cores at 25% utilization. MemSQL runs with all
cores at 100%. On the 8 core machine, MySQL keeps each core 80% utilized.

The number of inserts and queries can be easier explained when looking at
<https://github.com/memsql/bench/blob/master/sql_database.py>, starting at
get_games and ending at log. Every action is composed of one call to
get_games, one call to either start_game, make_move or win_game and one call
to log. On average, each action has two selects and between two and three
writes (with one update).

Because client code is shared between MemSQL and MySQL, we could not use
multistatement transactions, and every statement was executed in its own
transaction.

All of the code for the benchmark can be found at github.com/memsql/bench.

~~~
timaelliott
@ankrgyl

> every statement was executed in its own transaction.

I do not see any begin(), rollback() or commit()'s in your code. According to
the mysql-python docs, [http://mysql-python.sourceforge.net/FAQ.html#my-data-
disappe...](http://mysql-python.sourceforge.net/FAQ.html#my-data-disappeared-
or-won-t-go-away), autocommit is disabled by default. Therefore each of your
workers appears to be operating inside one very large transaction and never
actually committing anything to the database?

Can you clarify?

~~~
ankrgyl
Referring to <https://github.com/memsql/bench/blob/master/sql_database.py> ,
we use _mysql, not mysqldb. Here's some code that demonstrates:

    
    
        In [1]: import _mysql
    
        In [2]: c = _mysql.connect(host='127.0.0.1', user='root')
    
        In [3]: c.query("SELECT @@session.autocommit")
    
        In [4]: c.store_result().fetch_row(1)
        Out[4]: (('1',),)

------
halayli
I said this before and I'll say it again. Any benchmark that doesn't state why
one candidate is faster than the other is not a benchmark. It's bullshit
marketing material.

A benchmark that doesn't state what one candidate bottlenecks on and how the
faster candidate avoids the bottleneck (and at what cost) is really worth
nothing. It means whoever ran the results doesn't understand what they are
testing.

------
cpeterso
How does MemSQL's design differ from other in-memory relational databases,
such as Oracle's "TimesTen"?

<https://en.wikipedia.org/wiki/TimesTen>

~~~
nikita
MemSQL uses multiversion concurrency control (MVCC), lockfree data structures
and code generation. We are also building modern features like sharding.

Times 10 uses row locks, doesn't compile queries, doesn't work on a cluster.

~~~
onetwothreefour
Just FYI, any time you guys mention code generation, you sound clueless about
anything related to databases.

This stuff has been done for eternity, it's just not necessary these days.
Also, MemSQL doesn't work on a cluster yet, so that's not a feature _against_
Times 10.

Sigh.

------
jaequery
With the rapid advancements of SSD and RAM-drives, I wonder how that will
shape the future of databases.

I'm guessing we won't need solutions like this anymore.

~~~
heretohelp
That's what the RethinkDB people have purportedly been working on.

~~~
ddorian43
Their latest blogpost is from +1 year ago. Last i saw they were trying to make
memcache/redis on ssd. Any update or inside info?

~~~
heretohelp
'Fraid not. I'd talked to Slava back when I lived in New York but that didn't
grant me any information you can't find on the website. I really only talked
to him at the time because I was looking for work at the time and I'm a
database nut.

I don't know what they've been up to since.

------
executive
more hand waving nonsense. so how many times faster is it this time?

------
orangethirty
So what is this new DB for? General use?

~~~
nikita
\- OLTP: lots of small high throughput transactions \- Real time analytics
(you ingest data fast and query it at the same time)

More use cases are going to open up as we build more features.

------
davvid
Are there any benchmarks against postgres? One of the noted features is
concurrency; postgres scales more linearly than mysql when you add more cores
so I'd be curious to see a comparison.

------
nodesocket
Was MongoDB run with safe true or false?

------
jhugg
Is it listed anywhere which versions of MySQL and MongoDB were run?

~~~
ankrgyl
We used the latest stable versions (as mentioned in the article). It's now
updated to include the exact versions: MongoDB 2.0.6, MySQL 5.5.25, MemSQL 1b

------
heretohelp
The timing seems a little defensive, but I think this raises a good point.

Databases, scaling, persistence, and synchronization are hobbies of mine. As a
result, I generally find the "FSYNC AFTER EVERY DATABASE OPERATION!" crowd
rather tedious.

Even more so when they:

1\. Don't know what they're talking about

2\. Publish benchmarks of databases they don't understand compounded with a
misunderstanding of statistics

With that said, I'm skeptical of how useful MemSQL's specific trade-offs are.
They represent a nice _contrast_ to how most other databases are currently
configured, but I'm not sure what it's supposed to be for.

It _seems_ as if it's designed for situations where you're doing a lot of
table scans. This is rather odd. Are you optimizing for the lazy company that
won't set up an OLAP for analytics and is trying to make their OLTP database
do the work? The vast majority of transactional or just plain 'usual' work is
indexed fetches by id of some row, document, kv blob, etc.

I know OP is actually the CTO of MemSQL and seems to be rather straightforward
so I'd like an answer as to what MemSQL is trying to solve.

At present, it doesn't seem like MemSQL is actually better at the sundry
"fetch by pk/guid" 95% use-case than MySQL. If it is, I'd like some evidence
of that.

Answers that can be summarized as "BETTER THAN MYSQL!" will be met with
derision though ;) Talk to me like an engineer...in trade-offs.

Edit: that "MySQL is a bazillion times faster than MemSQL" post appears to be
the top Google result for MemSQL now. That's gotta sting. N.B. This is a
common mistake database vendors make. Not explaining their niche/use-case
clearly. They love to claim they're superior at everything. Basho is
marginally less guilty of this than most.

~~~
nikita
> It seems as if it's designed for situations where you're doing a lot of
> table scans.

This is not quite true. Lockfree data structure work exceptionally well for a
lot of small concurrent reads and writes. If you look at the benchmarking
code: <https://github.com/memsql/bench/blob/master/sql_database.py> you will
notice that there are a lot of queries like "fetch by pk/guid".

The trade-offs are: taking a hit the first time you run the query, being
limited to the amount of RAM, limited set of features for this release. The
benefits are throughput for "fetch by pk/guid" transactions and predictably
low latencies.

~~~
continuations
> being limited to the amount of RAM

What will happen if the data grows to larger than the RAM?

~~~
nikita
We stop accepting write queries. We also work on sharding. RAM gets you pretty
far 12K buys a terabyte these days.

~~~
moe
Out of curiosity, where do you get an 1T-box for 12K?

Spec'ing those out at Dell I arrive at >40k. I know channel-partner or own
build can trim that significantly, but your price sounds pretty amazing.

