
Sharding data models - pedrobelo
https://www.citusdata.com/blog/2017/08/28/five-data-models-for-sharding/
======
contingencies
For financial transaction services I can recommend sharding first by customer,
then by ledger. As a result, instead of enforcing double-entry book-keeping
standards within a single database, do it at an application-specific
middleware server layer to enforce only the guarantees you need.

As with all decisions, there are tradeoffs. For a little more up-front
complexity and a tiny nominal performance hit, this allows maintenance,
encryption, non-uniform storage paradigms to suit individual ledgers, rolling
upgrades, storage location migration, and other cool stuff which is typically
painful with traditional all-or-nothing RDBMS architectures.

~~~
adrianratnapala
Presumably you want a fairly strict consistency guarantee for whatever double-
entry invariants you do have. Does that cause a lot of complexity when you
kick it up to the middleware layer?

~~~
contingencies
It depends what you mean by significant. SQL update becomes a two phase commit
with double network connection setup, execute, teardown latency as a worst
case. External auditing must be made. Sounds like hell.

A flip-side view is that, in many cases, if you really want to trust your data
(not your database), then you want to be doing this stuff to a large extent
anyway... which means that, basically, it's just enforcing good practices that
should have already been present.

Complexity and per-TX latency increases, but for that you get maintainability
and a great deal of flexibility. Nothing is free... take your pick!

------
njay
Citus is a pretty powerful tool. We recently used Citus to help scale one of
our databases and wrote a blog post about it:
[https://hipmunk.github.io/posts/2017/Aug/16/a-fare-cache-
in-...](https://hipmunk.github.io/posts/2017/Aug/16/a-fare-cache-in-a-sharded-
data-cluster/)

~~~
bpicolo
What were the before/after performance and load numbers? Your post mentions
that data-refresh frequency was an issue, but doesn't mention what sort of
problems it caused. Would love to hear how the move affected query perf

------
xchaotic
Then there's the 0 model for sharding - don't shard, just replicate
everything, everywhere, with eventual consistency and MVCC

~~~
imtringued
Sharding and master to master replication are two different things.

The tradeoff of master to master replication limits you to CRDTs, append only
logs and manual merging of conflicts by the end user. Alternatively if
dataloss is an acceptable tradeoff you can also use a last write wins
strategy.

Sharding is basically having one isolated "database" per X (User, location,
etc) but the tradeoff is you can't have transactions across two databases.

Document databases usually do both. Each document is it's own tiny database
with atomic updates which then can be distributed over the cluster and they
support multi master replication for availability/automatic failover.

~~~
adrianmonk
Or you can do multi master replication, where you write to all replicas
synchronously (and deal with downtime of replicas by continuing if you manage
to write to a majority of them).

------
sdrothrock
What ORMs out there natively support sharding well? I use Django the most and
whenever I look into Django sharding, I don't see very many options being kept
up to date...

~~~
craigkerstiens
I'm not sure any do completely out of the box but many are very close.

We've created a library at Citus to make rails sharding turn-key,
activerecord-multi-tenant. We have a Django one in the works, if you want to
take a look and give us any feedback on it please drop us a note. From what
I've seen hibernate has some out of the box support, but we only have a few
customers leveraging it so not as familiar with it.

~~~
sdrothrock
Unfortunately, I don't think I have the time to test anything rigorously, but
I would definitely be interested in learning more about the Django solution
when it's more mature! Is there a mailing list or something I could sign up
for?

------
Beanis
What distinction is being made between sharding by entity and sharding a
graph? The approach seems to be the same, just with different naming.

~~~
elvinyung
Yeah, that section doesn't seem to be making much sense. I don't think there
are actually any real examples of "graph sharding" in the wild. The graph
databases that are available, like Neo4j, don't usually natively provide
horizontal partitioning. (Of course -- the problem of finding the minimum
k-cut of a graph is itself NP-complete. Doing this incrementally with a
dynamic graph is even harder.)

The one solution that was mentioned, Facebook's TAO, isn't actually really a
database; it's a cache, which means that it doesn't _really_ have to deal with
sharding in the way that persistent stores do. And it doesn't really shard at
all; it basically stores a complete copy of the world's social graph in every
region, which it can just populate from that region's MySQL replicas. (It's
amazing the things you can do when you can be eventually consistent.)

(From what I recall, the main social network's MySQL _also_ isn't really
sharded by graph in any fancy way; it's basically "just" hash-sharded by
entity ID.)

~~~
archit3cture
dgraph does try to provide horizontal scaling out of the box. The sharding is
done by predicate - cf [https://docs.dgraph.io/deploy/#multiple-
instances](https://docs.dgraph.io/deploy/#multiple-instances) for a
documentation link ; I am not sure how it behaves for very frequent predicates
though

------
edem
I would have liked to read more about graph sharding. The article was rather
sparse about it.

