
120K distributed consistent writes per second with Calvin - zenithm
https://fauna.com/blog/120-000-consistent-writes-per-second-with-calvin
======
itp
This seems cool, and I sincerely wish them nothing but success. That said, I
had a major sense of déjà vu while reading this post -- I worked at
FoundationDB prior to the Apple acquisition, when we published a blog post
with a very similar feel:

[http://web.archive.org/web/20150325003241/http://blog.founda...](http://web.archive.org/web/20150325003241/http://blog.foundationdb.com/databases-
at-14.4mhz)

I'm not trying to make a comparison between a system I used to work on and one
that I frankly know little to nothing about; rather, I'd suggest that building
a system like this just isn't enough to be compelling on its own.

~~~
halestock
Bah! I was so dissappointed when I heard about the Apple acquisition of
FoundationDB. Will any of the technology behind it ever see the light of day?

~~~
itp
Unfortunately I'm the last person to ask. While I did start at FoundationDB
pretty early (second employee), I ceased to be involved at the point of the
acquisition, and beyond that I've only heard a few rumors from former
coworkers.

As a business it was always an ambitious effort, and I'm not sure what could
or should have been done differently. But since then I've used a number of
other systems and thought to myself "boy, I wish I had FDB right now."

~~~
wwilson
Another former-FoundationDB guy here (hi Ian!), and I actually think the
business case for Apple open sourcing is very strong. I'm a fan of the layered
architecture we chose, but building efficient and powerful layers on top of
the core key-value store is a serious engineering effort in its own right. By
encouraging an open-source layer ecosystem (and operational and deployment
tools), Apple could leverage its investment in the core technology more
effectively.

Whether Apple's leadership agrees with me is another question. :)

~~~
evanweaver
We specifically chose a monolithic architecture for FaunaDB, since performance
improvements invariably come from breaking interface boundaries and sharing
additional information. It's been working out well.

~~~
wwilson
Yes, this is the argument that VoltDB made as well:

[https://www.voltdb.com/blog/foundationdbs-lesson-fast-key-
va...](https://www.voltdb.com/blog/foundationdbs-lesson-fast-key-value-store-
not-enough)

My feelings on this topic are mixed. On the one hand, I think many of the
specific examples chosen in that post are false (and have told John as much in
person). On the other hand, the general point that you can squeeze out
constant factor performance improvements by violating abstraction boundaries
is obviously usually true.

Nevertheless, I still think this is a bad argument. While it's true that
abstractions are rarely costless, they can often be made so cheap that the
low-hanging performance fruit is elsewhere. And in particular, cheap enough
that they're worth it when you consider all the other benefits that they
bring.

When I built a query language and optimizer on top of FoundationDB, my
inability to push type information down into the storage engine was about the
last thing on my mind. Perhaps someday when I'd made everything else perfect
it would've become a big pain (and perhaps someday we would've provided more
mechanisms for piercing the various abstractions and providing workload
hints), but in the meantime partitioning off all state in the system into a
dedicated component that handled it extremely well made the combined
distributed systems problem massively more tractable. The increased developer
velocity and reduced bugginess in turn meant that I (as a user of the key-
value store) could spend scarce engineering resources on other sorts of
performance improvements that more than compensated for the theoretical
overhead imposed by the abstraction.

I won't claim that a transactional ordered key-value store is the perfect
database abstraction for every situation, but it's one that I've found myself
missing a great deal since leaving Apple.

But I'm glad to hear that things are going well for you guys. Best of luck,
this is a brutal business!

~~~
jhugg
Hi Will. Thanks for the shout-out.

I still think many of the arguments in that blog post hold up for non-embedded
KV stores. I think you can mitigate a lot by aggressively caching metadata,
but eventually you end up moving the SQL engine closer and closer to the
storage layer to get performance. And yeah, you end up more monolithic and
testing gets harder. Sigh.

Some of this is workload dependent. If you're not touching many rows in your
queries and transactions, then you can get away with a lot more. But if you
give someone SQL, they're going to want to scan.

I wouldn't mind being proven wrong. Maybe Apple made FDB run SQL at legit
speeds. I haven't seen much from public projects that work this way to change
my mind yet.

> I won't claim that a transactional ordered key-value store is the perfect
> database abstraction for every situation, but it's one that I've found
> myself missing a great deal since leaving Apple.

How does Spanner not satisfy that itch? Not ordered matters?

~~~
wwilson
> How does Spanner not satisfy that itch? Not ordered matters?

I was probably unclear in my previous comment. Spanner is great! (And Spanner
is ordered). The particular aspect of FDB that I miss is what some of our old
customers called "the bottom half of a database" or "a database construction
kit". In fact FDB was an awesome modular building block for all kinds of
distributed systems, not just databases. We hacked up prototypes for a whole
bunch of these but sadly never got around to releasing them.

Spanner is a full-fledged enterprise grade database with opinions about your
data model, query language, types, etc. For the vast majority of customers,
that's much more useful than what FDB provided. But for me as somebody who
enjoys kicking around silly new ideas for distributed systems, it's a bit less
fun.

------
g0del_was_wr0ng
Including your 9x write amplification in the number of "consistent writes"
doesn't count -- like at all. I'm amazed nobody called you out on this yet.

You're doing 3k batches per second with 4 logical writes each, right? So that
is at most 3-12k writes per second using the way that every other distributed
database benchmark and paper counts.

Or otherwise - if you continue counting writes in this special/misleading way
- you'd have to multiply every other distributed db benchmark's performance
numbers with a factor of 3-15x to get an apples-to-apples comparison.

The 12k batched writes/sek through what I assume is a paxos variant is still
pretty impressive though! Good to get more competition/alternatives for
zookeeper & friends!

~~~
evanweaver
No, that's not write amplification. Replication and storage engine fanout are
not included. Instead, that number is the number of logical partition updates
per row, per transaction. This makes the test comparable to tests of key-value
stores that can only update one key per transaction. The FoundationDB test
mentioned elsewhere here was reported the same way.

If you want to include write amplification, then multiply by 6x again to
account for the replicated log and the tables themselves.

~~~
g0del_was_wr0ng
It's doing 12k rows in 3k user-issued write operations/transaction per second.

Counting any kind of "internal write effects" that result from a user write
(i.e. write amplification) is obviously done to mislead in the benchmark and
does not make it comparable to key-value stores.

12k writes/s is the number of rows that are written from a user perspective.
So 12k/s is also the number you have to use when comparing it to key value
stores. But of course, comparing Fauna with eventually consistent systems is
not a really fair comparison. You don't make it fairer by misleading in your
benchmark though.

Also, just because some other vendor posted a misleading benchmark on hn (I
don't know if they did) that doesn't make it right or means you should do it.
Just call them out on it too.

~~~
evanweaver
Indexes aren't internal write effects, they are user-defined. But we will have
additional benchmarks later on that focus on row commits only.

We tried to replicate a realistic workload rather than just target the best
case or worst case performance profile.

~~~
evanweaver
For purposes of this benchmark, the first yes, the second no. But neither of
them are free.

~~~
g0del_was_wr0ng
Well fair enough but that method of counting is not what everyone else does or
assumes, so somebody just reading the title "120k writes per second" gets the
wrong impression of whats going on.

(An uninitiated reader would assume you're comitting 120k rows per second from
the title, whereas it's "only" 12k rows and "only" 3k actual operations over
the wire. Still, 3-12k is pretty impressive)

------
web007
This description is very misleading.

120,000 writes per second is accurate, talking about actual durable storage
(disk) writes. But it's only 3,330 transactions, which should be the number
that a user cares about.

I don't have proper data and I'm a bit rusty, but I feel like Cassandra could
blow that away if you set similar consistency requirements on the client side
(QUORUM on read, same for write?). Am I understanding this correctly, or does
Fauna/Calvin give you something functionally better than what C* can do?

~~~
freels
A more apples-to-apples comparison with Cassandra would be FaunaDB
transactions and Cassandra's atomic batch mutations, or its PAXOS-based
lightweight transactions as opposed to single-cell writes tested in most
Cassandra benchmarks.

YMMV, but we've found the performance of Cassandra writing out similar-sized
multi-row atomic batches at QUORUM to be similar in this hardware
configuration.

FaunaDB transactions are quite a bit more powerful, as they can span multiple
keys, use conditionals and read-modify-write logic, and still resolve with
serializable semantics.

~~~
web007
That makes a lot more sense then. It's still a misleading statement to say
"writes" vs "transactions" since you could (potentially) make fewer writes and
support more transactions. The ratio between the two is a measure of
efficiency, but only transactions matter to end-users.

~~~
evanweaver
You're right that one number trades off another. I'm not sure that only
transactions matters, though.

Tracking logical writes makes the test comparable to tests of key-value stores
that can only update one key at time, which is pretty much every other
distributed database.

------
qaq
Maybe I am missing some special point but a decent PG box will do 1,000,000+
TPS vs 3,000+ TPS here. When pgXact lands it will do close to 2,000,000 TPS.
So reading all the posts about the amazing new db "X" that can do about N
times less than PG on a multi-node cluster I get confused why the numbers are
being presented as some sort of achievement.

~~~
otterley
That doesn't sound right. Even the newest NVMe devices can't do 1M writes per
second; they're maxing out at around 330k IOPS.

The 1M "TPS" you're referring to is a read-only benchmark (e.g.
[http://akorotkov.github.io/blog/2016/05/09/scalability-
towar...](http://akorotkov.github.io/blog/2016/05/09/scalability-towards-
millions-tps/)). Those are reads (most likely from the buffer cache), not
writes or transactions in any real sense.

~~~
qaq
330K iops for a single device you are very unlikely to be running a single
device. There are Fusion IO models that can do 1M IOPS but they are on the
exotic side. If you are optimising for throughput you can configure
commit_delay so you will fsync multiple commits.

~~~
chillydawg
Relaxing disk commits is a quick route to data loss. Might as well use mongo
at that point.

~~~
qaq
commit_delay does not relax anything it's just increases latency to group
multiple commits

------
zenithm
This is a new one to me...the referenced paper is here:
[http://cs.yale.edu/homes/thomson/publications/calvin-
sigmod1...](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf)

How does this algorithm compare to whatever Google Spanner does?

~~~
evanweaver
That's a good and complicated question. They both are fully ACID-compliant
systems. The biggest difference as a developer is that Calvin never blocks
reads, contested or not. You get causally consistent single-replica reads with
no coordination.

This makes the read performance equivalent to something like Cassandra at
CONSISTENCY.ONE, without giving up the cross-partition write linearization of
something like Spanner.

~~~
imglorp
Can Calvin scale beyond the OP claim?

I've personally seen a Cassandra ring go to more than 2M ops/sec.

~~~
jchrisa
Yes. This isn't near the top end, more like what happens when we run
benchmarks on a reasonable web infrastructure class cluster. This is still
only 5 machines in each datacenter.

------
imownbey
"Calvin's primary trade-off is that it doesn't support session transactions,
so it's not well suited for SQL. Instead, transactions must be submitted
atomically. Session transactions in SQL were designed for analytics,
specifically human beings sitting at a workstation. They are pure overhead in
a high-throughput operational context."

Is this specifically for distributed SQL only? I think there are some scalable
SQL systems that don't support sessions either.

~~~
jchrisa
Calvin is a generalized consistency protocol, that we use in FaunaDB to
support relational semantics (but not SQL) in our database.

Multi-query transactions can be useful, but the FaunaDB query language is
functional, rather than declarative like SQL, so composing queries that can do
everything you want is usually easier than SQL.

~~~
cakoose
How would you perform the classic "transfer money from one account to another"
operation?

Would you create a single operation that reads one record, checks that it's
enough, then adds the amount to another record?

Or maybe you'd first read both accounts, then issue a conditional write
operation that makes sure the data hasn't changed before doing the write?

~~~
freels
FaunaDB's query language makes it straightforward to do it the first way. All
queries are serializable, so any preconditions checked would gate transaction
commit as you would expect, and read-modify-write style transactions work.

edit: here's an example in our Scala DSL:

    
    
      Let {
        val amount = 50
        val balanceA = Select("data" / "balance", Get(Ref("accountA"))
        val balanceB = Select("data" / "balance", Get(Ref("accountB"))
        If(Gteq(balanceA, amount),
          Do(
            Update(Ref("accountA"), Obj("data" -> Obj("balance" -> Subtract(balanceA, amount)))),
            Update(Ref("accountB"), Obj("data" -> Obj("balance" -> Add(balanceB, amount)))),
            "Transfer Success"
          ),
          "Insufficient Funds"
        )
      }

~~~
cakoose
Is the wire format roughly isomorphic to the structure above? Or does the
Scala library convert this code-like structure into something simpler/flatter?

(BTW, would be nice if I could read your API docs without signing up for an
account.)

~~~
evanweaver
It is isomorphic; right now it's layered onto JSON, but eventually we will
support CBOR on the wire as well. Internally everything is CBOR with LZ4 block
compression.

The docs will eventually be available without an account.

------
olegkikin
2011: Benchmarking Cassandra Scalability on AWS - Over a million writes per
second

[http://techblog.netflix.com/2011/11/benchmarking-
cassandra-s...](http://techblog.netflix.com/2011/11/benchmarking-cassandra-
scalability-on.html)

Also a single SSD from 2015 is rated at 120K writes per second:

PM1725:
[http://www.samsung.com/semiconductor/global/file/insight/201...](http://www.samsung.com/semiconductor/global/file/insight/2015/11/pm1725-ProdOverview-2015-0.pdf)

------
lngnmn
Consistent writes to a permanent storage or didn't happen.

~~~
evanweaver
They are durable; will clarify.

~~~
lngnmn
Otherwise it would be like Mongo did - we have put your data in the OS buffers
- what could possibly go wrong?

~~~
evanweaver
Haha no...it's the opposite of that.

------
rystsov
Is it possible to download fauna to play with it on my own?

~~~
jchrisa
You can sign up for the cloud version in a few seconds here:
[https://fauna.com/serverless-cloud-sign-up](https://fauna.com/serverless-
cloud-sign-up)

~~~
rystsov
With the cloud version, it's impossible to run jepsen-like tests to validate
consistency and to observe cluster's behavior when the network is unstable and
nodes tend to crush.

~~~
jchrisa
We have some correctness results coming that should address your concerns.
Watch this space for updates.

~~~
rystsov
Will you release binaries or it's gonna be a pure cloud solution like
DocumentDB or DynamoDB?

~~~
evanweaver
We already have enterprise customers in production on-premises; there will
also be a downloadable developer edition. There are no service or library
dependencies other than the JVM.

