
Cross shard transactions at 10M requests per second - makmanalp
https://blogs.dropbox.com/tech/2018/11/cross-shard-transactions-at-10-million-requests-per-second/
======
elvinyung
Ok, while we're at this, I think that the most important part of the post was
this:

> Edgestore data was already well-collocated, which meant that cross-shard
> transactions ended up being fairly rare in practice—only 5-10% of Edgestore
> transactions involve multiple shards.

What if performant cross-shard transactions is a red herring, and the thing
that we should be looking more into is reliable automatic data colocation to
avoid performing cross-shard transactions as much as possible? There's decent
amount of academic research around this, with projects like SWORD [1] and
Schism [2] that study shard load balancing as a problem of hypergraph
partitioning. It seems like this might be worth incorporating into commercial
distributed database projects.

[1]
[http://www.cs.umd.edu/~amol/papers/edbt13.pdf](http://www.cs.umd.edu/~amol/papers/edbt13.pdf)

[2]
[http://www.vldb.org/pvldb/vldb2010/papers/R04.pdf](http://www.vldb.org/pvldb/vldb2010/papers/R04.pdf)

~~~
tahara
Thanks for the references -- will take a look.

Edgestore's API is set up to shepherd users into good collocation patterns by
default, and a lot of work over the past year or two went into improving
collocation and educating users about best practices. The collocation efforts
were actually orthogonal to implementing cross-shard transactions, but they
were obviously very beneficial.

~~~
elvinyung
Thanks -- just curious, am I correct to interpret this to basically mean that
thus far, the performance of the system basically relies on users to
explicitly define colos nicely within their application-level data model?

For some reason this reminds me of something like the entity group concept in
Google's Megastore [1].

[1]
[http://cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf](http://cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf)

~~~
tahara
There is somewhere between where we are today and a completely uncollocated
free-for-all where the system would fall over. There's a separate axis of the
rate at which users request transactions (with locks) versus non-transactional
writes using optimistic concurrency control that would come into play sooner.
Our guidance is therefore that users try to reserve transactions for when
there's correctness critical reasons why two objects need to be updated
together and rely on asynchronous primitives to handle "eventually consistent"
mutations.

------
amenghra
I have witnessed first hand companies who decide to do cross-shard
transactions when in fact they don’t really need it.

For example, imagine that you have a distributed key-value store. You want to
users (on different shards) to either both be able to see a piece of content
or neither. You can achieve this by allocating a key on either shard (or a
third shard), writing a reference to each user’s shards. If all of those
writes succeed, you can write to the new key which was allocated. If any
writes fail, you can bail and your data is consistent. Wrap the above logic in
a nice library and your service will scale horizontally.

~~~
tahara
For us (Dropbox), the threshold ended up being multiple product teams having
to implement ad-hoc two-phase commit over their datatypes and burning engineer
hours not to implement it but to prove that they had gotten it right and to
handle the clean up after any unsuccessful writes.

You're right that most systems probably don't need 2PC, which is why Edgestore
didn't include it until now. As mentioned in the post, we finally felt that we
had reached the right balance of tradeoffs to justify the API-level primitive.

~~~
jamesblonde
When you get to big enough scale, you need cross shard transactions. Google
reached it with Spanner.

If I look at Dropbox, i am sure features related to sharing folders between
people/organizations cross-shards. You can't avoid them if you want to offer a
fully featured product.

After a quick skim of 2PC in Edgestore (sorry, no time), it is unclear if
there is a single transaction coordinator (TC) or not. I assume it is a single
TC - that you can scale it to 10m trans/sec is impressive. The really hard
part is have multiple TCs and to design protocols to coordinate their recovery
after failure. Here is a good example in the open-source NDB (MySQL Cluster)
system -
[https://drive.google.com/file/d/1gAYQPrWCTEhgxP8dQ8XLwMrwZPc...](https://drive.google.com/file/d/1gAYQPrWCTEhgxP8dQ8XLwMrwZPcXHW_1/view)

~~~
tahara
We have a routing tier of a few hundred machines. Any of those can serve as
transaction coordinator. This is one of the places where having a transaction
record is useful -- in general, our locking and non-transactional read scheme
will wait nicely for a 2PC transaction to complete, but in the face of
failures, any actor in the system can abort the 2PC transaction by marking the
transaction record. There's also a really nice optimization for non-
transactional reads that allows them to read even in the event of a staged but
not yet committed 2PC transaction (you can prove to yourself that a linearized
read can occur on either side, if the 2PC transaction is still pending when
the read begins).

------
peter_d_sherman
The idea of creating an abstraction layer over thousands of MySQL nodes is a
novel one, and should be commended. I like the idea a lot, and by DropBox's
scale and success, it must work well.

I would be interested in a granular point-by-point comparison between this
layer and native MySQL replication. In other words, if someone were to rewrite
MySQL replication such that it did everything that this abstraction layer did
(in addition to replication), what would it need to do?

Now (and this is strictly academic/theoretical), I'm curious what would be
necessary to modify in the abstraction layer if the abstraction layer was to
support a whole bunch of disparate SQL databases underneath it, i.e.,
Postgres, SQLite, SQL Server, Oracle, etc. (No, that wouldn't be practical,
but it would be an interesting exercise to really learn where the gotchas
might be where working with different SQL dialects...)

~~~
elvinyung
Vitess [1] is a similar kind of abstraction layer that's compatible with the
native MySQL wire protocol. It also supports atomic cross-shard transactions
using 2PC.

[1] [https://vitess.io/](https://vitess.io/)

~~~
dalyons
Does anyone know of a vitess equivalent for Postgres? I’d love to use it

~~~
elvinyung
Citus seems to be doing pretty well.

------
bufferoverflow
Dropbox is very easy to shard though, the users are mostly independent. Take
something like Facebook, and it becomes much much harder - anybody can friend
anybody, anybody can message anybody, anybody can like or comment on public
posts.

~~~
tahara
That's what we thought, too. (Un)fortunately(?) that's not the case anymore
given our product offerings :)

