

Databases at 14.4Mhz - chton
http://blog.foundationdb.com/databases-at-14.4mhz

======
ChuckMcM
This is Foundation DB's announcement they are doing full ACID databases with a
14.4M writes per second capability. That is insanely fast in the data base
world. Running in AWS with 32 c3.8xlarge configured machines. So basically NSA
level data base capability for $150/hr. But perhaps more interesting is that
those same machines on the open market are about $225,000. That's two rack,
one switch and a transaction rate that lets you watch every purchase made at
every Walmart store in the US, in real time. That is assuming the stats are
correct[1], and it wouldn't even be sweating (14M customers a day vs 14M
transactions per second). Insanely fast.

I wish I was an investor in them.

[1] [http://www.statisticbrain.com/wal-mart-company-
statistics/](http://www.statisticbrain.com/wal-mart-company-statistics/)

~~~
vilda
Not all ACID transactions are equal. This is just a key-value store-like test.
It shows the potential to scale, yet nothing regarding performance in real
word.

32 c3.2xlarge instances have 1920GB memory. Given 1 billion 16B+ 8..100B
values the whole dataset fits just into memory.

The Cassandra test mentioned [1] sustained loss of 1/3 instances. That's very
impressive! Would love to see how F-DB handles this type of real-life
situation (hint hint for follow up blog post).

[1] [http://googlecloudplatform.blogspot.cz/2014/03/cassandra-
hit...](http://googlecloudplatform.blogspot.cz/2014/03/cassandra-hits-one-
million-writes-per-second-on-google-compute-engine.html)

~~~
jbergens
They do seem to talk about writes per second. That means you must flush to
disk, not read from cache.

[Edit, more info] They seem to run with multiple clients which also stresses
the system. From their explanation * The clients simulate 320,000 concurrent
sessions

------
hendzen
This is very impressive, however...

See this tweet by @aphyr:
[https://twitter.com/aphyr/status/542755074380791809](https://twitter.com/aphyr/status/542755074380791809)

(All credit for the idea in this comment is due to @aphyr)

Basically because the transactions modified keys selected from a uniform
distribution, the probability of contention was extremely low. AKA this
workload is basically a data-parallel problem, somewhat lessening the
impressiveness of the high throughput. Would be interesting to see it with a
Zipfian distribution (or even better, a Biebermark [0])

[0] -
[http://smalldatum.blogspot.co.il/2014/04/biebermarks.html](http://smalldatum.blogspot.co.il/2014/04/biebermarks.html)

~~~
Dave_Rosenthal
Actually, @aphyr's analysis is wrong. In fact it's actually worse that he
says. There is an exactly 0% chance of conflicts between two transactions that
each only write 20 random keys (as they do in this test). This is perhaps
counterintuitive, but, because there are no reads, any serialization order is
possible and therefore there is no chance of conflict.

Equally wrong is his assumption that this has any bearing on the performance
of FoundationDB. In fact FoundationDB will perform the same whether the
conflict rate is high or low. This isn't to say that FoundationDB somehow
cheats the laws of transaction conflicts, just that it has to do all the work
in either case. There is no trick or cheat on this test--this same performance
will hold on a variety of workloads with varying conflict rates as well as
those including reads.

~~~
fry_the_guy
Your definition of performance is a little weird.

Presumably you will want to retry conflicting transactions, so you generally
would not count them towards your throughput.

For example if I commit 100 transactions per second, and 90% of them return
conflicts, I am only successfully committing 10 transactions per second.

~~~
Dave_Rosenthal
That's true: a large conflict rate would eat into your budget because clients
would be retrying.

Of course most real world workloads are (hopefully!) no where near 90%
conflicts. If you had a 90% transaction conflict rate you could expect
FoundationDB performance to drop by about 30-60% due to retries (and, worse
news, you would need to rethink your architecture).

------
jrallison
We've been using FoundationDB in production for about 10 months now. It's
really been a game changer for us.

We continue to use it for more and more data access patterns which require
strong consistency guarantees.

We currently store ~2 terabytes of data in a 12 node FDB cluster. It's rock
solid and comes out of the box with great tooling.

Excited about this release! My only regret is I didn't find it sooner :)

~~~
lclarkmichalek
Why do you need a 12 node cluster for 2 TB of data?

~~~
jrallison
With a replication factor of 2 (for fault tolerance), it's ~4.5 TB.

FoundationDB requires SSD drives, which the best we can get efficiently in our
data center is ~670 GB of usable space (3x480GB raided).

12x670 = 8040 GB

We try to keep extra space available for node failures (FDB will immediately
start replicating addition data if it notices data with less than the
configured number of replicas).

Our dataset currently grows at a decent pace, so we over provision a bit as
well.

~~~
jedberg
Why do you use RAID on your nodes if you have an RF==2?

~~~
jrallison
A few reasons off the top of my head:

1) We're still interested in the nodes being as reliable as they can be. With
RAID 5, we need two simultaneous disk failures to brick a node. With RAID 0
(to increase usable disk space), any of the 3 disks can brick the node.

Even with 12 nodes and RF of 2, an order of magnitude more node failures would
be more likely to disrupt our service. Perhaps this makes more sense in a
larger cluster with a higher RF factor?

2) We're using commodity hardware from a dedicated server provider, which uses
RAID 1 or 5 configurations exclusively by default. We haven't reached the
point where we felt that was enough of an issue or gain to investigate
changing it.

3) Having more CPUs/disks in the cluster (12 nodes rather than cramming all
data on 6) can be a good thing... as FDB scales fairly linearly.

~~~
jedberg
I think you misinterpreted what I said. I explained it more clearly below. I
suggested using the same 12 nodes but putting each one as a RAID 0, which
would get you more reliability and more storage for the same cost. In your
current config, two dead disks possibly bricks the system -- in the config I
propose, you'd need four dead disks before anyone noticed.

What I'm suggesting is that you think of the cluster more holistically, since
I assume your goal is a reliable cluster, not reliable nodes. As a nice bonus
you get more "free" disk space.

~~~
jrallison
That was my interpretation of your comment, but I'm still not sure I follow.
In my understanding, by using RAID 0, _any_ single disk failure will brick a
node. Each node would then have 3 disks that are ticking time bombs
(multiplying the failure rate by 3). How is that more reliable?

In RAID 5, I can have 1 disk failure on a node with no problem. 2 disk
failures on the same node, and I only lose 1 node of my 12 node cluster (I.E.
cluster is fine). I can also theoretically lose 12 (1 on each node) + 2*(RF-1)
disks, and gracefully repair the situation with 0 interruption.

What's the benefit of RAID 0 other than increased usable disk space and
perhaps write performance? It seems you're decreasing reliability
significantly for those gains.

------
bsaul
Just watched the linked presentation about "flow" here :
[https://foundationdb.com/videos/testing-distributed-
systems-...](https://foundationdb.com/videos/testing-distributed-systems-with-
deterministic-simulation)

Is it really the first Distributed DB project to have built a simulator ?

Because frankly, if that's the case, it seems revolutionary to me.
Intuitively, it seems like bringing the same kind of quality improvement as
unit testing did to regular software development.

PS : i should add that this talk is one of the best i've seen this year. The
guy is extremely smart, passionate, and clear. (i just loved the The Hurst
exponent part).

~~~
jchrisa
At Couchbase we have a plethora of simulators. Simulations of cluster topology
changes, simulations of failure scenarios, simulations of workloads to
estimate requried cluster sizes, etc. Here's one:
[https://github.com/couchbaselabs/cbfg](https://github.com/couchbaselabs/cbfg)

~~~
Dave_Rosenthal
I think the difference is that FoundationDB has only one, and it's not an
external simulation. The actual code that runs in production can also
deterministically simulate a cluster of itself.

I do believe this is unique among publically-available distributed databases.

~~~
anonymousDan
Why is it a good thing that it's not an external simulation?

~~~
itp
I'll give you some brief thoughts from my experience working at FoundationDB,
but if you really want the in-depth answer to what makes our simulation
framework an enabling technology for everything we do, you should take a look
at the talk my colleague Will Wilson gave at Strange Loop [0].

Everything in our core is written in Flow, our asynchronous programming
environment. The architecture of our server is essentially single threaded,
with (pseudo-)concurrent actors allowing a single process to satisfy multiple
roles at once.

The interfaces that these actors use to communicate -- with the disk, over the
network, with the OS in general -- are abstracted and implemented in Flow as
well. Each of these interfaces not only has at least one "real"
implementation, but at least one, and sometimes more, simulated
implementations.

Since we run multiple actors in a single process all the time, running yet
more actors, pretending to be different machines, all still in the same
process, was an obvious step. These workers communicate with each other via
the _same_ network interfaces that real-world workers do in real clusters.

In the end we are able to become our own adversaries, pushing the limits of
the system in ways that just don't happen often enough in the real world to
test and debug in the wild. Our simulated implementations are allowed to
present any behavior that can be observed in the real world, or justified by
the spec, or implied by the contract. And they can do so much more frequently.

But most importantly -- as in, without this our system would never have been
developed -- we can simulate these pathological behaviors in a completely
deterministic fashion. Having the power to run a million tests, simulating
multi-machine clusters trying to complete workloads while suffering from the
most unbelievable combination of network partitions, dropped packets, failed
disks, etc. etc., all while knowing that any error can be replayed, stepped
through in a debugger, in a single process on a single machine... without this
capability, we would never have had the confidence to build our product or
evolve it as aggressively as we have.

tl;dr: We're not making simulations _of_ the system, we're running the system
_in_ simulation, and that makes all the difference.

[0]:
[https://www.youtube.com/watch?v=4fFDFbi3toc](https://www.youtube.com/watch?v=4fFDFbi3toc)

------
dchichkov
I'm only familiar with other key-value storage engines, not FoundationDB, but
it seems like the goals are: "distributed key-value database, read latencies
below 500 microseconds, ACID, scalability".

I remember evaluating a few low latency key-value storage solutions, and one
of these was Stanford's RAMCloud, which is supposed to give 4-5 microseconds
reads, 15 microseconds writes, scale up to 10,000 boxes and provide data
durability.
[https://ramcloud.atlassian.net/wiki/display/RAM/RAMCloud](https://ramcloud.atlassian.net/wiki/display/RAM/RAMCloud)
Seems like, that would be "Databases at _2000Mhz_ ".

I've actually studied the code that was handling the network and it had been
written pretty nicely, and as far as I know, it should work both over 10Gbe
and Infiniband with similar latencies. And I'm not at all surprised, they
could get pretty clean looking 4-5us latency distribution, with the code like
that.

How does it compare with FoundationDB? Is it completely different technology?

~~~
jrallison
Don't know much about RAMCloud, but from the description:

"Many more issues remain, such as whether we can provide higher-level features
such as secondary indexes and multiple-object transactions without sacrificing
the latency or scalability of the system. We are currently exploring several
of these issues."

Sounds it doesn't provide multi-key ACID transactions at the very least.

~~~
dchichkov
Yes, It looks like it doesn't support multi-key ACID transactions. As per:

[https://ramcloud.atlassian.net/wiki/display/RAM/Deciding+Whe...](https://ramcloud.atlassian.net/wiki/display/RAM/Deciding+Whether+to+Use+RAMCloud)

> If your application requires the higher-level data model features of a
> relational database, such as secondary indexes and transactions spanning
> multiple objects, then it may be difficult to get that application running
> on RAMCloud. The current RAMCloud data model is a key-value store with a few
> extension (such as conditional writes, table enumeration, and multi-
> read/multi-write operations); it does not yet support secondary indexes and
> multi-object transactions. Most of these operations can be implemented on
> top of the existing RAMCloud features, but it will take extra work and these
> operations will not run as fast as the built-in RAMCloud operations. We are
> currently working on adding higher-level features to RAMCloud.

Still. Aside from that, it is free, and open source and gives _hundred times_
better latency?

~~~
misframer
> Still. Aside from that, it is free, and open source and gives hundred times
> better latency?

It doesn't have the same guarantees. If you need those guarantees, then
there's no comparison and it doesn't make sense to compare latencies.

~~~
dchichkov
I'm not so sure. Have the latencies claimed by FoundationDB been measured on
multi-key transactions?

And if you need these guaranties, it seems like "Most of these operations can
be implemented on top of the existing RAMCloud features". I would expect it to
be possible to implement that and stay well withing the latency budget,
provided that a single-key durable write is 100 times faster.

~~~
nlavezzo
All of the transactions in the test noted in the blog are multi-key
transactions. (20 key updates)

~~~
dchichkov
When claiming the 14.4MHz number, is that per 20-key update? Or a single key
update in the 20-key update? And what's the latency on the 20-key update?

~~~
itp
The writes we are measuring are individual keys, modified atomically in
transactions of 20 keys at a time. So this test is doing 14 million writes per
second as 720 thousand transactions per second.

------
felixgallo
This looks very interesting and congratulations to the FoundationDB crew on
some pretty amazing performance numbers.

One of the links leads to an interesting C++ actor preprocessor called 'Flow'.
In that table, it lists the performance result of sending a message around a
ring for a certain number of processes and a certain number of messages, in
which Flow appears to be fastest with 0.075 sec in the case of N=1000 and
M=1000, compared with, e.g. erlang @ 1.09 seconds.

My curiosity was piqued, so I threw together a quick microbenchmark in erlang.
On a moderately loaded 2013 macbook air (2-core i7) and erlang 17.1, with 1000
iterations of M=1000 and N=1000, it averaged 34 microseconds per run, which
compares pretty favorably with Flow's claimed 75000 microseconds. The Flow
paper appears to maybe be from 2010, so it would be interesting to know how
it's doing in 2014.

~~~
emn13
There's got to be some mistake there somewhere - on your part, or on theirs -
because there's no way erlag improved from 1.09 seconds to 34 _micro_ seconds
on pretty much any benchmark between 2010 and 2014. Even the factor 1000 (the
message count) isn't enough to account for that difference - something's
fishy.

~~~
Dave_Rosenthal
1M messages passes in 34 microseconds would be ~0.1 CPU clocks per message
pass. Maybe an optimizer is stepping in and short-circuiting?

~~~
felixgallo
heh, not only that, but my fingers accidentally made my test massively
concurrent! Such are the perils of erlang. When I defeat the optimizer and
also force message passing to be serialized both inside and outside a trial, I
get an average of 525000 microseconds, which is more in line with the other
results.

------
shortstuffsushi
As someone who has no idea about the cost of high-scale computing like this,
is $150/hr reasonable? It seems like an amount that's hard to sustain to me,
but I have no idea if that's a steady, all the time rate, or a burst rate, or
what. Or if it's a set up you'd actually ever even need -- seems like from the
examples they mention (like the Tweets), they're above the need by a fair
amount. Anyone else in this sort of situation care to chip in on that?

~~~
landryraccoon
Per the article, it's about 1/20th the cost that Google would charge you for
the same number of writes.

~~~
shortstuffsushi
I had seen that as well, actually. Is this the kind of data or transaction
level that might only be expected for a Google level throughput?

~~~
emn13
Well, do you do 14.4 million serializable writes per second or anything close
to that?

~~~
shortstuffsushi
Note my original comment -- nope. I haven't worked on anything that would have
a need anywhere near that amount of throughput.

------
w8rbt
I thought it was DB connections over radio waves just above 20 meters. Also,
it's MHz, not Mhz.

------
maliki
This sounds great compared to my anecdotal experience with DB write
performance; but is there a collection of database performance benchmarks that
this can be easily compared to?

The best source for DB benchmarking I know of is
[http://www.tpc.org/](http://www.tpc.org/). The methodology is more
complicated there, but the top results are around 8 million transactions per
minute on $5 million systems. This FoundationDB result is more like 900
million transactions per minute on a system that costs $1.5 million a year to
rent (so, approx $5 million to buy?).

The USD/transactions-per-minute metric is clear, but without a standard test
suite (schema, queries, client count, etc.), comparing claims of database
performance makes my head hurt.

~~~
Dave_Rosenthal
I think you mean "900 _million_ transactions per minute". Of course that
overstates things since each TPC-C transaction entails a lot more than one
write. TPC-C is about 2/3 read and 1/3 write, and each TPC transaction might
do 20 low-level read+write operations (I'm actually not sure, but I think
that's in the ballpark.)

In the NoSQL world many people have converged on a workload of 90% reads/10%
writes to individual keys. We show 90/10 results on our performance page [1]
but in this test we do 100% writes to stress the "transaction engine", which
processes writes.

Since we have our SQL Layer [2] as well, we will run some more-comparable SQL
tests in the future.

[1] [https://foundationdb.com/key-value-
store/performance](https://foundationdb.com/key-value-store/performance)

[2] [https://foundationdb.com/layers/sql](https://foundationdb.com/layers/sql)

~~~
maliki
(Ah, yes, thank you, 900 million per minute. I typo'd.)

Nice performance page. It pre-answers my follow up question, which was how
linear is scaling with more cores? Looks solid.

------
illumen
Impressive.

However I think there's still plenty of room to grow.

320,000 concurrent sessions isn't that much by modern standards. You can get
12 million concurrent connections on one linux machine, and push 1gigabit of
data.

Also, 167 megabytes per second (116B * 14.4 million) is not pushing the limits
of what one machine can do. I've been able to process 680 megabytes per second
of data into a custom video database, plus write it to disk on one 2010
machine. That's doing heavy processing at the same time on the video with
plenty of CPU to spare.

PCIe over fibre can do many transactions messages per second. You can fit 2TB
memory machines in 1U (and more).

Since this is a memory + eventually dump to disk database, I think there is
still a lot of room to grow.

------
mariusz79
MHz not Mhz.

------
tuyguntn
I wish I would work with such great professionals as a Junior for 10years,
right after school!!!

------
oconnor663
Any chance of an open source release for the database core? :)

------
lttlrck
"Or, as I like to say, 14.4Mhz."

Sorry, I don't like that at all.

~~~
simcop2387
I would agree. While 14.4 million writes per second is impressive, it
definitely doesn't fit the definition of the Hz unit (cycles per second).

~~~
jrochkind1
"Hertz" just means "per second". It can be anything per second, there are no
units in the numerator. It just a measure of frequency, anything per second.

[http://en.wikipedia.org/wiki/Hertz](http://en.wikipedia.org/wiki/Hertz)

If anything you are counting is happening 14.4 million times a second, then
that thing is happening at 14 megahertz. It's just a unit. If there are 14.4
million writes per second, then the writes are occuring at 14.4 mega hertz.
That's just the definition of hertz.

Haha, Google agrees, check it out:
[https://www.google.com/?q=4+per+second#q=4+per+second](https://www.google.com/?q=4+per+second#q=4+per+second)

~~~
ctdonath
"Hertz" just means " _cycles_ per second".

The reference is to frequency, not average or typical rate; there's a
difference between a tone and pink noise.

~~~
jrochkind1
And in their test, they made writes at that frequency, right?

It is, however, commonly used for average frequency rates too, not just
measurements of constant frequency. "Frequency" the word means how frequent
something happens, right?

I don't know what the 'something' to measure frequency of in pink noise would
be, but if there is something to count over time, you can measure it's
frequency.

"Frequency is the number of occurences of a repeating event per unit time"
says wikipedia.
[https://en.wikipedia.org/wiki/Frequency](https://en.wikipedia.org/wiki/Frequency)

However, it is true that "In some fields, especially where frequency-domain
analysis is used, the concept of frequency is applied only to sinusoidal
phenomena, since in linear systems more complex periodic and nonperiodic
phonomena are most easily analyzed in terms of sums of sinusoids of different
frequencies." In some fields.

I have no idea why we're arguing about this except that we like arguing on the
internet though.

~~~
ctdonath
_I have no idea why we 're arguing about this except that we like arguing on
the internet though._

This.

------
imanaccount247
Why the deliberately misleading comparisons? If you are doing something
genuinely impressive, then you should be able to be honest about it and have
it still seem impressive. One tweet is not one write. Comparing tweets per
second to writes per second is complete nonsense. How many writes a tweet
causes depends on how many followers the person who is tweeting has. The 100
writes per second nonsense is even worse. Do you just think nobody is old
enough to have used a database 15 years ago? 10,000 writes per second was no
big deal on of the shelf hardware of the day, nevermind on an actual server.

~~~
jrochkind1
it's just a comparison to give you a sense of what those numbers look like,
order of magnitude comparisons. I didn't find it misleading, and didn't think
it meant that you'd be able to run twitter on their db on one commodity
server. But it's an order of magnitude estimate to give you a sense of the
scale.

