
TAPIR: A new high-performance, transactional key-value store - iyzhang
https://github.com/UWSysLab/tapir
======
matthewaveryusa
Great to see fresh ideas! One thing I don't like in the presentation is that
tapir is presented as doing better than what is out there without stating the
conditions. First off there's quite a bit of hand-waving when it comes to
leadership bottlenecks -- please assume that sane sharding is occurring. I'm
not entirely sure transactions spanning partitions is something unique to say
paxos. The abort rate vs. contention seem off. looking at the paper all the
nodes are in a single datacenter. I would love to see these numbers where
tapir is spread across larger geographical regions. My suspicion is that at
higher latency will negatively impact the abort rate more so than with a
strong leader. What about poor clock synchronization conditions? What about
testing with a variety of client latencies? Since the client is effectively
acting as a leader in tapir, the client is, in some ways, contending with
other clients and the abort rate may be correlated to client latency. I don't
think high-latency clients observe this same correlation than with a strong
leader. I wish more of the compromises were presented.

~~~
iyzhang
The leader bottleneck will continue to exist even if there great sharding
because the leaders simply process more messages than the replicas, so TAPIR
will allow each shard to support more throughput.

The paper has an evaluation for multi-data-center replication in Figure 12. We
assume that the clients are web servers, so they are always close to one of
the replicas, but not all of them. The result we found is basically that TAPIR
performs better in the multi-data center case except when the leader is in the
same data center as the client. So it depends on whether you can always
guarantee that the leader is in the same data center as the client.

The abort rate continues to essentially track the latency needed for commit.
So, TAPIR reduces the abort rate compared to OCC because it reduces the commit
latency. At very high contention, locking is likely to make slightly more
progress, but no systems with strong consistency will be able to provide high
performance. If you are interested in some other ways to optimize for the high
contention case, take a look at our work on Claret:
[http://homes.cs.washington.edu/~bholt/projects/claret.html](http://homes.cs.washington.edu/~bholt/projects/claret.html)

We also tested with high clock skew. The paper notes, "with a clock skew of 50
ms, we saw less than 1% TAPIR retries." Since the clients can use the retry
timestamps to sync their clocks, it only adds an extra round-trip, so it still
leaves TAPIR with the same latency as a conventional system, even in cases of
extremely high clock skew.

------
jph
The key insight: "... existing transactional storage systems waste work and
performance by incorporating a distributed transaction protocol and a
replication protocol that both enforce strong consistency. Instead, we show
that it is possible to provide distributed transactions with better
performance and the same transaction and consistency model using replication
with no consistency."

The full paper is here:
[http://delivery.acm.org/10.1145/2820000/2815404/p263-zhang.p...](http://delivery.acm.org/10.1145/2820000/2815404/p263-zhang.pdf)

~~~
kluck
Interesting. Can you elaborate on this? I am not sure what "inconsistent
replication" means.

~~~
ComNik
It means that operations don't get executed in the same order on every
replica. If the systems state depends on the ordering of operations, then each
replica might be in a different state for the same set of operations.

"Consistent replication" would be using a protocol like Paxos to have the
replicas decide on a single order of operations.

~~~
TallGuyShort
Huh... isn't that required for being "transactional"? I should probably read
the paper...

~~~
iyzhang
No, that was our key insight. Basically, you can think of it as, using
something like Paxos requires all replicas to run the same code. So every
replica checks for transaction conflicts. However, you actually only need one
replica to detect a conflict, so there is no reason for every replica to run
the same checks -- this is wasted work. Instead, we just make sure that each
conflict is checked by at least _one replica_.

The other interesting conclusion is that there are other workloads like this.
For example, it is possible to build a reliable and better performing lock
server in this way as well (and there is an implementation in the github
repo). So, you'd get something similar to Chubby, but where the latency to the
server is only a single round-trip in the common case.

------
proksoup
I found this video a couple links in, the speaker I think is Irene Zhang,
listed for many of the commits on that repo:
[https://www.youtube.com/watch?v=yE3eMxYJDiE](https://www.youtube.com/watch?v=yE3eMxYJDiE)

~~~
mhw
And also the poster of this item, presumably.

------
nickpsecurity
"enabling TAPIR to provide the same transaction model and consistency
guarantees as existing systems, like Spanner, with better latency and
throughput."

Oh, hell yeah! Now that's great stuff. Can't wait to see the next step done by
this or another team: building an alternative to the F1 RDBMS that Google
built on Spanner. Would give CochroachDB some competition.

~~~
msackman
I'm the author of GoshawkDB.

I've just been watching the talk on this at
[https://www.youtube.com/watch?v=yE3eMxYJDiE](https://www.youtube.com/watch?v=yE3eMxYJDiE).
GoshawkDB has a very similar design wrt the messaging and replication design.
In fact, in some places, it appears that GoshawkDB's design is a little
simpler.

There are obviously many differences too: for example GoshawkDB runs
transactions on the client, GoshawkDB uses Paxos Synod instead of 2PC, and
GoshawkDB clients only connect to one server so there are 2 extra hops, but
that's a constant so from a scaling pov, it should behave the same.

One of the biggest differences is GoshawkDB uses Vector Clocks (that can grow
and shrink) rather than loosely synchronized clocks.

This TAPIR work does look great - I had no idea that it was ongoing. I'll read
through the paper too, but it's great that GoshawkDB has so many design ideas
in common.

~~~
nickpsecurity
Very interesting. I went to search for your stuff only to find I had this
bookmark:

[https://news.ycombinator.com/item?id=10778946](https://news.ycombinator.com/item?id=10778946)

So, I knew it was interesting and possibly great work but didn't have any time
to look at it. I'll move it up the backlog a bit. Maybe at least give the
papers and stuff a good skimming tonight. :)

Note: Parallel, ongoing, similar work going unnoticed is a huge problem in
both IT and INFOSEC. I have a meme here about how we constantly reinvent the
same solutions, make the same mistakes, or move at a slower pace due to lack
of connections with similar work. I keep proposing something that's like a
free and open combo of ACM or IEEE with members who are academics, FOSS
coders, pro's... people that _contribute_ at least in writing. Stash papers,
code, and forums there. So, odds of serendipity increase. Thoughts?

~~~
msackman
Heh, definitely! I could ramble for hours about how there are so many
disincentives to share anything until launch - the whole industry really isn't
set up to actually make progress. However, multiple independent co-invention
of stuff is still valuable from a validation point of view. The worst bits I
see are when something from 20 years ago gets reinvented. I have emailed Irene
just to see if there's anything or anyway to collaborate.

~~~
nickpsecurity
Great! It's awesome to see you taking the initiative! Might be an exception in
the making.

Im about to leave for work but quick question. What I want to see is an open
equivalent of Google's F1 RDBMS: the best one. Does yours already provide its
attributes, is it more like Spanner jnstead, or what? Aside from CochroachDB,
where is OSS on a F1 competitor?

~~~
msackman
So GoshawkDB doesn't have an SQL engine currently, so in that way it's
probably not comparable with F1. GoshawkDB stores and accesses an object
graph. Hopefully the howtos on the website will help people get into the
mindset.

I'm not sure if it's worth trying to compare anything to Spanner or F1 because
unless I'm mistaken, no one outside of Google can use F1 or Spanner - they're
not released. So who knows what the performance of these things actually is?
There's no way to verify the claims made in any Google paper.

~~~
nickpsecurity
"So GoshawkDB doesn't have an SQL engine currently, so in that way it's
probably not comparable with F1. GoshawkDB stores and accesses an object
graph. Hopefully the howtos on the website will help people get into the
mindset."

Gotcha.

"I'm not sure if it's worth trying to compare anything to Spanner or F1
because unless I'm mistaken, no one outside of Google can use F1 or Spanner -
they're not released. So who knows what the performance of these things
actually is? There's no way to verify the claims made in any Google paper."

I think it's worthwhile for these reasons:

1\. They describe it in enough detail in their papers for comparisons or maybe
clones to be made.

2\. That's led to competitors or open-source clones of tech before. Remember
map reduce?

3\. They've deployed it operationally.

4\. Since when can Google not handle a backend tech it claims to be able to
do? I thought their rep was solid on that stuff.

So, Google already as a strongly-consistent, scalable, multi-datacenter RDBMS
with NoSQL-like performance. If good on enough workloads, that's the best
thing I've heard in that category since NonStop SQL. The GPS thing, while
brilliant hack, might be hard for adoption. An improvement area CochroachDB
people are already targeting. A full clone or competitor could rock DB world
as a great alternative to Oracle clusters or NonStop for mission-critical
workloads where some delays were tolerable.

------
ngbronson
Thank you for publishing the code for your paper. It's an excellent
contribution by itself, and should help people understand and evaluate the
ideas. Are you going to publish the TLA+ specification as well?

------
gregwebs
Reminds me of Hyperdex which has consistent replication through value
dependent chaining but only enforces ordering of commits when needed.

------
imaginenore
You claim it's high-performance, and you provide no benchmarks.

How does it compare with Redis, Aerospike, Tarantool, Couchbase/MemBase,
Memacached, VoltDB, LevelDB, Kyoto Cabinet, Riak, Cassandra, RocksDB, LMDB,
Neo4j, HBase, ArangoDB, Voldemont, FoundationDB?

~~~
iyzhang
Figure 14 in the paper gives comparisons with MongoDB, Redis and Cassandra
using YCSB. The git repo has the YCSB bindings for TAPIR, so you can run TAPIR
against any of the other systems that also have YCSB bindings.

~~~
imaginenore
Thanks. If it really outperforms Cassandra, it's pretty impressive.

~~~
brazzledazzle
I don't think there's any claims that it did. The paper and the talk both made
it clear that Cassandra outperformed it. I think MongoDB is the only one it
beat. But I think the point is that it's able to perform well while being
consistent, not that it's faster than "weakly consistent" storage systems like
Cassandra.

