
Re: MemSQL the “world's fastest database”? (2012) - Cieplak
http://www.postgresql.org/message-id/4FE8A2DE.5050806@agliodbs.com
======
jandrewrogers
A well-designed disk-oriented database engine operating in cache will often be
as fast as an in-memory database engine.

I once designed an in-memory database kernel for a supercomputer. One of the
most fun designs I ever did. However, I also learned a lot about the realities
and nuances of in-memory designs that are magnified on a supercomputer but
still have large impacts on modern server hardware.

The biggest insight was this: memory behaves like complex block storage, with
everything implied with respect to optimization, if you really care about
performance. You can easily lose 10x the theoretical throughput by treating
in-memory structures as random access storage.

In disk-backed systems, we tend to ignore this aspect of memory because,
relative to disks, memory really is uniformly random-access to a first
approximation. But disk-based systems are inherently optimized around multi-
tier block access patterns. In a database kernel without disks, treating
memory access appropriately (i.e. not random access) becomes a major point of
optimization with huge gains if you do it right. And even in systems with
highly optimized disk I/O subsystems (i.e. no POSIX I/O calls or kernel
caching), treating memory as random access starts to put a serious drag on
overall system performance.

In-memory databases allow you to avoid dealing with disk I/O which gives a
major boost in performance. However, if performance matters, it does not let
you treat memory as a random access storage device. Understanding how to
design algorithms and data structures that optimize memory I/O patterns often
has as much performance uplift relative to naive random access assumptions as
there is when going from a disk I/O to in-memory. Most in-memory systems never
take the second step.

~~~
jbert
> The biggest insight was this: memory behaves like complex block storage,
> with everything implied with respect to optimization, if you really care
> about performance.

Could you describe the factors causing that? Is it just the cacheing hierarchy
or are there other effects?

~~~
jandrewrogers
The biggest effects are varying page sizes as bits move through silicon and
latencies to memory access, which reflects cache hierarchies but also NUMA,
DIMMs, and other considerations with respect to how memory is connected to the
system and the topology of how memory moves into the local CPU cache. Due to
the physical size of supercomputers, the latency variance for "flat" cache
coherent memory access can become so bad that it has pathological
characteristics on algorithms that work adequately with lower latencies. With
how fast and efficient CPUs are today, on single system boards this has a huge
effect. For example, it is not uncommon to lock a core to physical memory that
is directly attached to that CPU's memory channels instead of some other CPU's
memory channels in databases.

Some performance-sensitive open source libraries do varying amounts of memory
topology aware optimization. The rudimentary tricks can often be worth a 2x
performance boost for in-memory processing.

In the same way that really clever schedulers and disk I/O systems (they don't
let the OS do it) in good databases try to optimize the average access time
and variance, high-performance in-memory systems use very similar techniques
to optimize the average access time to any particular piece of memory. The
techniques are very similar in the abstract but there is a lot less written
about the nuances of in-memory scheduling; it often involves understanding
lower level details of the silicon than programmers are used to studying. The
net effect is a large reduction in busy stalls and contention.

Also, people tend to forget that even though lock-free algorithms are fast on
average, they tend to have a lot of overhead in systems with many cores unless
they are used sparingly and accessed relatively infrequently. Eliminating that
waste can have large performance benefits.

To your question: In order to minimize the need for thread coordination and
optimize memory scheduling, the first thing you need to do is design
algorithms and data structures that approximately match the topology of your
storage, whether memory or disk or both. And that rarely looks like idealized
random access memory. Then you optimize for locality of execution. I've
designed database kernels that had thousands of parallel threads running. It
turns out that the techniques used to eliminate almost all coordination in
that case produce extremely efficient database kernels on a commodity server
with vastly smaller resources.

Understanding the topology of real silicon systems is a large part of
"optimization".

~~~
znep
That makes a lot of sense. Regardless of where your data is, you want to best
map your accesses to the underlying reality of whatever hardware is in use.
And memory has significant locality on any even moderately big system.

I do think this glosses over the difference in data structures that someone
focusing on efficiently storing and reading the data to and from disk might
use versus the data structures that someone focused mainly on efficiently
representing the data in memory might use, ignoring durability. Queries are
one part of this, but dealing with updates and indexing can also be quite
important.

I don't know if that difference is fundamental to the design of a database or
just more of a lingering consequence of many databases being designed around
most data being on disk, but it is what I observe looking at a variety of
current database solutions.

Getting back to the main article here, I've been doing some basic testing
against MemSQL today and, "world's fastest" aside, I like a lot of what I see,
other than painfully long query parse/compile times. It does, however, appear
to be true for my queries that most of the performance benefits are due to the
distributed query engine and not due to any fundamental data structure
differences compared to something like postgres or mysql/innodb. But my
queries are very anti index.

SpaceCurve also sounds interesting, hopefully we can firm up our use cases
enough and get far enough along in a technical evaluation that I can find time
to play with it.

------
ankrgyl
This post is missing a key trade-off that you get with an in-memory row store
(tl;dr ability to build lock-free), but also since 6/25/2012 several other
innovations in MemSQL that have shipped over the past 2+ years.

The main advantage of having a main-memory row store is the ability to
leverage random-access data in memory. This unlocks you to use random access
aggressively and build lock-free indexes (atomic instructions are available
only for data that's in memory). With these tradeoffs you can build extremely
concurrent data structures that simply aren't possible with a system that must
manage a buffer cache. See [http://blog.memsql.com/the-story-behind-memsqls-
skiplist-ind...](http://blog.memsql.com/the-story-behind-memsqls-skiplist-
indexes/) for more details.

The article also suggests that MemSQL is not durable in the event of a power
outage. This is simply not the case - we've had durability since the first
ever public release of MemSQL, including both asynchronous and synchronous
modes.

MemSQL now also has distributed transactions and MPP, supports intra and inter
data-center replication, semi-structured (JSON) data, and even has a column
store that works as a hybrid between disk and in-memory.

And BTW, I actually agree with this mentality in the context of column stores.
There's no real advantage to having an "in-memory" column store because of the
append-only nature of columnar blobs of data. That's why ours is a hybrid.

~~~
colanderman
As someone who knows nothing about database implementations but a lot about
memory, this makes no sense:

> _atomic instructions are available only for data that 's in memory_

 _All_ CPU instructions are available only for data that's in memory (or
registers). No (sane) database on Earth operates solely by issuing disk I/O.

That linked blog post doesn't support your claim, either. It makes some points
about "indirections" and "buffer caches", neither of which is relevant to the
applicability of lock-free techniques or atomic instructions.

(Quick proof of both of the above: _all_ virtual memory systems abstract both
(a) disk access (swap) and (b) pointer indirection (TLBs), without restricting
atomic operations.)

What _is_ true, is that atomic operations, and memory barriers for that
matter, are not effective across coherency domains. So, if you have two
processes concurrently accessing a memory-mapped block device _on different
computers_ , then yes, you lose atomic operations and lock-free data
structures. But, AFAIK, this is not how most databases operate: most run as
one or several processes _within the same coherency domain_ (e.g. the same
computer). The operating system ensures that multiple processes mapping the
same file maintain a coherent view of it (primarily by mapping each disk block
to at most one physical memory location).

~~~
ankrgyl
Happy to fill in the gaps here. Building a data structure that is lock free,
performant at concurrent reads and writes, and fast at scanning data is very
challenging. Skip lists enable us to implement this in a straightforward way,
but are aggressive with respect to random access. You can indeed leverage swap
to extend a lock free skip list onto the disk, but the fact that a skip list
links together nodes with pointers does not play very well with swap. Swap
performance for random access is an extremely well known problem in Linux.

Disk-based databases reduce their dependence on random access by using data
structures like B-Trees, which do a much better job (i.e. maintain good scan
and seek performance) as they spill on and off the disk. These data
structures, however, are not feasible to implement lock free.

~~~
colanderman
So it's _not_ a disk vs. memory thing, it's a B-tree vs. skiplist thing. That
makes a lot more sense.

~~~
latch
I'm pretty sure the two are related, no? A B-trees is a disk-friendly
structure (shallow and wide), skiplist aren't (but they are a concurrency
friendly structure)

~~~
colanderman
Right. My point is that "you can't use atomic instructions with disk-backed
databases" is oversimplified to the point of being incorrect: it's wholly
feasible for, say, Postgres to implement a skiplist-based index (or some other
lock-free data structure) while still backing its store with the disk, with
the only downside being a performance hit if working memory is not large
enough.

------
lazyloop
That message is from 2012.

~~~
dang
Thanks. Added.

------
zwischenzug
These things are like a con-trick on naive developers.

Do they really think that the brains behind the big iron DBs hadn't considered
this stuff?

Our transactions don't do _any_ fsyncs, with a small risk of losing 4k of data
that's recoverable from application logs in any case. As the post says you
still need sort algorithms, pages &c &c.

------
abjorn
He mentions RethinkDB as an in-memory DB, but..it's not.

~~~
latch
RethinkDB has a long history ... someone better informed can shed some light
on this, but when this was written, it was being specifically designed to be
used with SSDs. Clearly, that's still not "in-memory", but it was probably
reasonable to point out that it was designed with a different set of
constraints.

~~~
mglukhovsky
I'm one of the founders of RethinkDB. At no point did we ever work on an in-
memory database product.

RethinkDB is immediately consistent by default -- in our earlier days we did
spend a lot of time optimizing our serializer for solid-state drives. However,
we never designed it with, "the tenet that [the database] was 100% in memory."

~~~
nilsimsa
Curious what kind of optimizations you did for SSDs.

~~~
mglukhovsky
Here's a few links (from previous HN threads and blog posts) that dive into
some of the optimizations we made around SSDs:

* [https://news.ycombinator.com/item?id=4795443](https://news.ycombinator.com/item?id=4795443)

* [http://www.rethinkdb.com/blog/on-trim-ncq-and-write-amplific...](http://www.rethinkdb.com/blog/on-trim-ncq-and-write-amplification/)

------
latch
Over the last couple years, I've become a fan of in-process databases. Being
able to use whatever data structures best fit your data, avoiding the network
and serialisation (and subsequent GC) overhead tends to make for clean (and
very fast) systems. Obviously it has drawbacks.

~~~
CHY872
This is almost always not the best approach. Database software has been
written in order to make developing software as painless as possible.
Transactions (even at the single write level), foreign keys, indices,
constraints are all there to make sure that your database should as much as
possible not be able to end up in a broken state.

Writing a database yourself is immediately reinventing the wheel.

In terms of the advantages you mention:

\- Datastructures - this is a niche advantage. There are probably some
constraints, but all of the datastructures that I've seen map nicely to a
serial form - one that might not is probably bad smell (though that might just
be because I haven't seen any decent examples).

\- Avoiding the network - you can do this already with (say) sqlite, which is
probably ideal for what you're describing.

\- Serialization - bad luck - if you want to persist your data you must still
serialize it.

\- GC - with a modern runtime that can do generational garbage collection this
is as good as free.

\- Clean - usually at the expense of correctness. If you have enough data and
time, you'll start seeing interesting bugs. I agree that writing raw SQL is
frustrating, but there are usually libraries that do standard tasks whether
that be configuration or ORM.

\- Fast - you sound like you're at a data scale where the difference in speed
doesn't matter - and even if you did have data on the order of hundreds of
millions of rows, you'd likely only see a very small perf difference (assuming
sane batching etc).

Writing your own database is probably fine if no existing database meets your
needs (whether that be price, performance, storage capacity, etc) but this is
rarely the case for all but the most demanding applications.

~~~
latch
I think you make good points. I wanted to clarify how I used in-process DBs
because it clearly wasn't obvious from my post.

I still rely on a "normal" database as the authoritative source of all data,
and all writes go to this DB. But changes to the data are queued, workers pick
up these changes, and update the in-process DB which handles 90% of read-
requests (which, for most apps, is probably > 90% of their traffic).

In some ways, it's a cache. But, it doesn't do misses (it's assumed all the
data is there), it's kept relatively consistent (you can normally propagate
changes in < 2s), and you have access to much more than just a GET.

Also, about performance...this lets us handle read requests with less than
<1ms response times vs say..50ms (or 200ms if we're talking about most ORMs).
That can have a huge impact on (a) usability (especially for things sensitive
to low latency (autocomplete)) and (b) cost (difference between $200/m and
$10K/month). You also don't need to put ALL your data in this. Just the most
frequently accessed.

Hope that clarifies a little.

------
zak_mc_kracken
Reminds me of Prevayler [1], a database that had similar insane claims
("twenty thousand times faster than a regular database") and which carefully
omitted to say that it was an in-memory database.

[1] [http://prevayler.org/](http://prevayler.org/)

~~~
nl
Prevayler is _always_ made the point that it is in-memory. Notably the front
page says: _It is an implementation of the Prevalent System design pattern, in
which business objects are kept live in memory and transactions are journaled
for system recovery_

It's been around for quite a while. Nearly 15 years ago Ward Cunningham (yes,
that Ward Cunningham) was interested in the system and wrote a benchmark for
it - I was one of the people who contributed data[1].

I never thought it was a great idea then, but back in those days computers
were usually RAM-starved. Now it _might_ make a little more sense.

[1]
[http://c2.com/doc/SerializationThroughput/](http://c2.com/doc/SerializationThroughput/)

------
jey
What's up with the random plug for "VoltDB" apropos of nothing?

~~~
latch
I think he's just giving credit where credit is due. I read it as "I'm making
a generalisation about in-memory databases, but VoltDB, which people would
associate as an in-memory database, stands out as having other worthwhile
innovations."

------
jhugg
It's a bit unfair to pick on a 2-yr old post, but I don't agree with what Josh
is saying here, beyond the fact that anyone claiming to be the world's fastest
database is talking about a particular benchmark that makes them look good.

So there are quite a few things you can do to go faster once you've decided
your data is going to fit in memory. An easy one is to build memory-centric
indexes that never hit disk. These can be several times faster than the disk-
page-sized b-trees that most disk-backed systems use, even if the disk-backed
systems are sitting on a ram-drive.

A second thing you can sometimes get away with is more granular locks on data
structures. Since you don't worry about holding a mutex during a page fault,
you don't need super complex concurrent data structures. It might be faster
just to shove a queue in front of an index and let each index lookup happen
sequentially. This is one of the biggest VoltDB tricks (though I'm
simplifying). It's also easier to build something like the MemSQL skip-list in
memory, but I've never seen it outperform the simpler VoltDB system at either
non-column-based scanning or at index lookups. Lock-free data structures may
not block, but they're not simple and they make other tradeoffs that affect
performance.

As far as scanning tuples goes, which the recent MemSQL marketing calls out, I
don't think that has much to do with disk or memory, but rather the
architecture and implementation. Vertica's big innovation was using a column-
store to scan faster and this seems to be what MemSQL is doing here. If you
want to do full table scans, this is very smart. Although you pay a loading
penalty as a tradeoff for query performance. Vertica's huge win was technology
to make that loading penalty smaller; I'm not sure what the costs to load
MemSQL's column data are. I suspect one key innovation is being cheaper than
Vertica.

Another legit innovation of VoltDB was to give up external transaction control
along with disk. The only thing worse than keeping a transaction open while
waiting for disk is keeping it open while waiting for a user. Getting rid of
both allows for a crazy fast system when you can assume no substantial waits
in your fast path. Using batched SQL or stored procs, you can keep the
conditional logic you love, but go fast. Many NoSQL systems (and MemSQL) get
this same speedup by just dropping transactions altogether. MVCC does a lot to
alleviate this problem, but adds its own complexity; it's really hard to get
right/fast over a network.

Hekaton is another system that has made really different choices when focused
on memory and reaped benefits. I think for business reasons, it's really tied
to MS SQL Server's disk implementation. Though that has pros to go with cons.
Hekaton is also new since Josh wrote this post.

And all of these systems can be fully/mostly persistent on disk. It's more
important that your data _fits_ in memory for design choices than it never
writes to disk. VoltDB in particular can perform many millions of synchronous
writes per second. It can do this with SSDs or even with spindles using a
flash-backed disk controller.

Disclaimer: VoltDB Eng.

