Hacker News new | past | comments | ask | show | jobs | submit login
How does database sharding work? (planetscale.com)
343 points by creativedg on April 7, 2023 | hide | past | favorite | 107 comments



> I suppose the more fundamental question is: why are you not using a database that does sharding for you? Over the past few years the so-called “serverless” database has gotten a lot more traction. Starting with the infamous Spanner paper, many have been thinking about how running a distributed system should be native to the database itself, the foremost of which has been CockroachDB. You can even run cloud Spanner on GCP.

This is the most important paragraph in the article. In a world where things like Dynamo/Cassandra or Spanner/Cockroach exist, manually-sharded DB solutions are pretty much entirely obsolete. Spanner exists because Google got so sick of people building and maintaining bespoke solutions for replication and resharding, which would inevitably have their own set of quirks, bugs, consistency gaps, scaling limits, and manual operations required to reshard or rebalance from time to time. When it's part of the database itself, all those problems just... disappear. It's like a single database that just happens to spread itself across multiple machines. It's an actual distributed cloud solution.

My current employer uses sharded and replicated Postgres via RDS. Even basic things like deploying schema changes to every shard are an unbelievable pain in the ass, and changing the number of shards on a major database is a high-touch, multi-day operation. After having worked with Spanner in the past, it's like going back to the stone age, like we're only a step above babysitting individual machines in a closet. Nobody should do this.


All the "auto-sharding" DBs have their own quirks, especially around hot key throughput. You often end up having to add "sharding bits" to the beginning of your keys to get enough throughput. The size of one partition is usually tiny compared to one partition of an RDBMS too. So it ends up being not nearly the panacea that it would seem to be.


Software side sharding keys seem significantly simpler than managing sharding infrastructure. That's the beauty of memcached right?


Would love to be able to use Spanner..

What do you suggest one does if one has to run on Azure though?

PS Azure Cosmos is not a real product, it is a beta toy that Microsoft just has made expensive enough that people think it cannot possibly be as bad as it is..


> PS Azure Cosmos is not a real product, it is a beta toy that Microsoft just has made expensive enough that people think it cannot possibly be as bad as it is..

what kind of problems you ran into with Cosmos?


It’s supposed to be MongoDB compatible (if I’m not mistaken).

Last I checked, that “compatibility” run at least a couple major version behind the real MongoDB.

To be fair, the same is probably also true with AWS’s DocumentDB


I think that AWS DocumentDB uses the last mongodb version that wasn't their new(ish) SSPL license and any compatibility beyond that was "clean room" backported. Its possible Azure is in a similar place.


DocumentDB is a thin layer that converts the mongo commands to Postgres, uses Aurora Postgres. If you check, the constraints are exactly same. Even list RDS CLI returns the DocumentDB clusters.

Most obvious indicator is DocumentDB is single node write whereas Mongo is really not, coming from Aurora Postgres.


You could consider one of these three –

1. yugabytedb (postgres compatible)

2. cockroachdb (postgres compatible)

3. tidb (mysql compatible)

You could self-host these on azure or subscribe to each of their managed service offering.

Of course, sticker price for managed will be more expensive than self-hosted enterprise product which itself will be more expensive than the self-hosted, community-supported, open-source option (which seems to be a real option with YugabyteDB).


It always triggers me to read "postgres compatible" when they've entirely skipped features such as database triggers

They've just selected a few features from postgres which they're compatible in.


I agree. I guess what they mean is you can use the existing adapter for your language.

The PostgreSQL organisation should create some trademarked badges to properly label the levels of compatibility other products offer.


YugabyteDB supports triggers. YugabyteDB is not just „postgres compatble”, it is postgres with a distributed layer.


It's also a real option with CockroachDB, unless you need regional sharding. Which a lot don't actually need anyways.


YugabyteDB tablespaces will get you there inside of their Apache 2 license.


Recently joined an Azure shop so new to Cosmos, but it's fairly abstract as a product. It seems to be a front for different database types like Mongo or Postgres or SQL Server with some additional features such as subscriptions.

Would be interested to hear your challenges.


If it's one more man-week of pain a year, but the spanner solution add a additional cost that is 10 times that, there is an incentive for the employer to keep the painful solution.

That and lock in, open source ecosystem and so on.

It's not a black and white situation.


> In a world where things like Dynamo/Cassandra or Spanner/Cockroach exist, manually-sharded DB solutions are pretty much entirely obsolete.

Not really. Cassandra is a write optimised, slow, network and IO heavy beast thats a pain in the arse to administer. We replaced a 6 node m4.4xlarge with a single db.m4.2xlarge. on postgres.

You need to pick your DB to match your data, not the otherway around.


Spanner and Cassandra really shouldn't be in the same sentence. They are optimized for very different use cases. The "obsolete" part of that quote does apply to Spanner, TiDB, and CockroachDB in my experience. I haven't used Yugabyte, but the other sharded databases, including Vitess (TiDB is what Vitess is trying to be), actually make life harder.


you're missing the point. spanner and cassandra as for completely different ways of storing/accessing data.

For most people(ie 95% of people) After putting a caching layer, paying for a bigger RDS(or equivalent) DB service is the cheapest and fastest option.


> (TiDB is what Vitess is trying to be)

Not true. Vitess is far more proven.


I'm not sure what "more proven" exactly means. If you mean, it is not as well known and used in US? Yes that's true because the core developers are Chinese. If you mean, it hasn't been used at the same scale in production? False, just completely false.

At my current job, we have reached a scale and use case that requires either manually sharding the database or using a distributed one. 20 TB including indexes, 10-20 tables with >1B rows, and spikes of up to 10K/s transactions. We put Vitess, TiDB, CockroachDB, and Spanner head-to-head, including running production scale load tests. Spanner won out because it is far superior to them all for scaling geo-replication and sysadmin. But TiDB was a close second because it just works, scales, and fast. Vitess on the other hand was extremely buggy and has a very long, undocumented list of unsupported SQL features. Yes Vitess has better press and more well known, but it is an inferior technology imo. TiDB is already what Vitess claims to be.


20TB of data and 10k/s of transactions is not a large cluster. There are Vitess cutters with petabytes of data and 100s of millions of users doing millions of QPS.

The largest Vitess cluster we know of runs on 10s of thousands of servers. So "False, just completely false" is dramatic.

Slack is powered by Vitess https://slack.engineering/scaling-datastores-at-slack-with-v... Twitter runs Vitess https://blog.twitter.com/engineering/en_us/topics/infrastruc...

I am sorry you had a buggy experience running Vitess yourself but you just wrong about the size of these production workloads.


I didn't claim that I am dealing with a gigantic workload (total QPS is a lot higher btw). Rather I claimed we are at the point where a distributed database is a good idea if the business wants to continue to scale without major database trouble. I gave the specs so others can understand the parameters of the evaluation. If Vitess didn't perform well at that scale, I can't imagine the pain at Slack and Twitter scale. I don't envy them at all.


https://twitter.com/stewart/status/1252385972014088194

They are pretty happy!

Different people find different things difficult. Don't feel bad.


I don't. I am very happy to not use Vitess and more than willing to recommend your competitors. Don't feel bad.


> There are Vitess cutters with petabytes of data.

Is it GitHub? Where can I find more details about them? It will be interesting to read more about it.


Most orgs, including mine, also have the experience of a cassandra cluster simply melting down. That was more common ten years ago, but it is pretty much chaos when it happens.

If the non-SQL distributed databases, I found FoundationDB to be by far the most robust and zero-overhead.

However, at least through version 6, it had absolutely terrible, service-destroying behavior if any of the nodes became low on storage, dropping transaction levels through the floor and since transactions are used for recovery and cluster expansion... Since it does a very nice job auto-sharding, if you're not paying attention this tends to happen to multiple nodes around the same time.

I wish it would get more attention though. It is a really amazing tool.


How does cockroach compares in terms of performance to manual sharded databases ?

My intuition is that a properly sharded database will perform faster-or-same as a non-sharded one in all scenarios. Whereas automatically-sharded database will actually perform worst until you start reaching critical traffic that a single instance won't handle no matter what.

Am i wrong ?


[disclosure, former cockroachdb engineer]

you can get expected "single shard" performance in CockroachDB by manually splitting the shards (called "ranges" in CockroachDB) along the lines of the expected single shard queries (what you call a "properly shared database"). This is easy to do with a single SQL command. (This is what we do today; we use CockroachDB for strongly consistent metadata).

The difference between CockroachDB and a manually sharded database is that when you _do_ have to perform some cross-shard transactions (which you inevitably have to do at some point), in CockroachDB you can execute them (with a reasonable performance penalty) with strong consistency and 2PC between the shards, whereas in your manually sharded database... good luck! Hope you implement 2PC correctly.


The point about strong consistency and 2PC cross-shard is a good one. Even among other "auto-sharding" distributed relational databases, not all of them provide that.


> My intuition is that a properly sharded database will perform faster-or-same as a non-sharded one in all scenarios.

No, because as soon as you have to JOIN across shards performance plummets. I mean, you can't JOIN across shards so you have to simulate that client-side which is very slow.

The secret to performant manual sharding is to figure out a way to minimize JOINs across shards, and that the ones you have to do are very small. Or to start replicating data which quickly becomes its own nightmare of consistency.


You can join like that with YugabyteDB. But it would be good to get intimate with explain analyze.


Where the line gets blurry is that there can be layers of sharding inside the database even if it is never exposed to the end user. At the limit, disk pages are essentially a shard and treated similarly. Some database kernels do true sharding and resharding within a single server transparently for performance reasons, it isn't just for having data on multiple servers or cores.

There is no intrinsic performance penalty for automatic sharding, it can be as fast or faster than any other scheme. The "faster" cases are when the sharding structure must be adaptive to have uniform distribution of data and load, such as when the key distribution is intrinsically unpredictable or when you are doing multi-attribute sharding. In these cases, effective sharding can only be created by software at runtime.


> My intuition is that a properly sharded database will perform faster-or-same as a non-sharded one in all scenarios.

This is workload dependent, if your queries only go to one shard, this is likely true. But if you have cross shard queries, then it’s no longer true.


> In a world where things like Dynamo/Cassandra or Spanner/Cockroach exist, manually-sharded DB solutions are pretty much entirely obsolete.

Far from it. Plenty of companies that are now running mysql or postgress will shard manually when they will need to scale.


It can still be an obsolete practice even if some continue to choose to do it.


distributed decentralized databases are their own special nightmare. they are all full of hidden traps, footguns, pains, limits. the more you push them the more these emerge.

in some ways the stone age was terrible, and in some ways the stone age was far superior to the current age. we've gotten soft, lazy, stupid, comfortable. we used to be feral animals, but now we're more like house plants. if we got more used to doing things like dumb animals, life would be simpler, and we could appreciate what we have more. but it's hard to resist the temptation of feeling like what we have is never enough. the grass is always greener, and how terrible that it isn't ours.


I just have to say - thanks for your response.

I am a full stack dev but I end up doing a lot of front end and rarely have to touch SQL. Posts like yours help me to realize there's still a lot I do not know while still being informative. So thanks <3


Solutions like Spanner are fantastic for some things (and within Google it's a no-brainer, since you're not paying an outrageous markup), but besides being expensive, they don't just let you drop them in as a scalable replacement for an existing usage pattern, and usually sharding starts first coming up when you're at a huge scale and already have a ton of app code and database design but have grown past what a single db can support (that's a massive amount of traffic these days, well over 1M DAU for most apps). See for instance https://cloud.google.com/spanner/docs/schema-design, a lot of what you have to plan for is very different than if you were designing a normal (e.g.) Rails app with typical patterns.

It's a great option, as long as you're aware of the tradeoffs and don't expect it to act (and cost) exactly like a single Postgres server.

My order of ops generally goes:

- Can you get away with pure (or mostly) key/value for your hottest table(s) and mostly avoid queries? Use Dynamo or Google Cloud Datastore, but deeply understand their quirks (e.g. never index a timestamp field on Datastore or you're worse off than a weak-ass SQL server for write-heavy loads). These can scale basically forever with zero effort if you fit their constraints, and are cheap enough

- Can you tolerate the price of Spanner and deal with it not being normal SQL? Go for it, expect some non-trivial migration cost

- If you have to shard/cluster yourself, can you make Redis work? It'll be easier operationally than managing N SQL DBs of any kind. I know, if you could you'd probably have been fine using Dynamo...

- Can you soft-shard, e.g. use different machines for different tables and spread load enough to get each table onto a single server? Do it.

- Can you minimize re-sharding ops? (If you're in a massive growth phase, the answer is no) Ok, fine, do SQL sharding, and make sure you have well-tested code to do it and you mostly use key/value access patterns.

- Consider a beefy AF bare metal SQL server, see how far that gets you.

- If none of this applies, re-evaluate the cost ($ and time) of Spanner, you're paying a big price one way or another

- Only bad ideas from this point on, just do sharding but it'll suck...

Nowhere on my list: running your own cluster for Cassandra, Mongo, or anything similar, it's such an ops nightmare and there are hosted services that are better in every way but cost.

If you do end up writing your own sharding layer, best of luck, I have yet to see any reasonably production-ready ones that work well with pretty much any web framework or ORM. And writing these is so tough, so many edge cases to consider...not something I hope to ever do again.


> Consider a beefy AF bare metal SQL server, see how far that gets you.

Yep, this is of course the default - although you'll at least want replication for availability and disaster recovery, but that's not too bad with PG/RDS.

> they don't just let you drop them in as a scalable replacement for an existing usage pattern, and usually sharding starts first coming up when you're at a huge scale and already have a ton of app code and database design but have grown past what a single db can support

I think once you get beyond what fits on one primary box, you're almost certainly at the point of rethinking your data access models anyway. Manual sharding is disruptive too.


> Consider a beefy AF bare metal SQL server, see how far that gets you.

Quite far, possibly to millions of users.

Case in point: StackOverflow runs on SQL Server.


Actually it can vary "from not that" far to "all the way" (StackOverflow). It depends very much on the schema and the type of workload. But in general people really underestimate how far an RDBMS can take you.


Yes and no - I think people also underestimate how miserable running a single giant RDBMS can be. Everything gets hard and dangerous - backup/restores, upgrades, online migrations & schema changes. Adding the wrong index can easily brings the whole thing down. They are complicated beasts that get very fragile at the large end (size+throughput)


Speaking of backups, how do you get consistent backup of a sharded DB? Not everything is sharded, some data are replicated and should be consistent.

Everything becomes fragile at the large end.


well im comparing it to a natively distributed database, like for eg dynamoDB. I've worked with truly massive dynamodb "instances" and you really dont have to do much, they arent fragile, they just work. I don't necessarily know how their backup system works under the hood, but it backs up to a consistent point in time.


"millions of users" are vastly different scales with different use cases. StackOverflow is mostly serving cached content.


Because distributing a database is sensitive to cap semantics (AP or CP) and data dependencies (graph partitioning is hard *) and storage engine choices are driven by use-cases and the general technical solution is thus highly complex**. Spanner uses atomic clocks, for example. Running CockroachDB yourself is [very likely] not the same thing as using a saas varient, either. Sight unseen, it can not be 'trivial'. Same for Spanner. The general solution seems to require paying someone to provide the service. In sum, it is not a clear cut yes/no situation.

btw, [distributed] Postgres iirc was never as stellar the single node (the stuff we sing praises of) vs the distributed deployment. I'm sure it has improved significantly.

> "manual operations required to reshard or rebalance from time to time. When it's part of the database itself, all those problems just... disappear."

Not really correct.

* "Choosing the right keys can help Spanner evenly distribute data and processing to avoid hotspots"

https://cloud.google.com/spanner/docs/schema-design

https://cloud.google.com/blog/topics/developers-practitioner...

** https://static.googleusercontent.com/media/research.google.c...

[Spanner certainly did -not- start off as a distributed RDBMS. Because that project would have never been given a green light. Because it is understood just how complex that system would need to be. It started off as a distributed k/v. That's it.]

"[I]n many ways today’s Spanner is very different from what was described [in original Spanner whitepaper]"

...

"The initial focus of Spanner was on scalability and fault-tolerance, much as was the case for many systems at Google. In the last few years Spanner has increasingly become a fully-fledged database system. Production deployment with internal customers taught us a great deal about the requirements of web scale database applications, and shaped the techniques we presented in this paper. Aggressive focus on horizontal scalability enabled widespread deployment without investing too heavily in single machine performance. The latter is one of the areas where we are making continuous improvements, such as upgrading lower level storage to Ressi."

"The original API of Spanner provided NoSQL methods for point lookups and range scans of individual and interleaved tables. While NoSQL methods provided a simple path to launching Spanner, and continue to be useful in simple retrieval scenarios, SQL has provided significant additional value in expressing more complex data access patterns and pushing computation to the data."


A favorite resource: https://learn.microsoft.com/en-us/azure/architecture/pattern...

Microsofts Azure Cloud Patterns is some of the best documentation out there. It's not Azure centric. It focuses on why you may want to do something and describes the techniques that are commonly used to solve it. It's pretty great.


I also couldn't recommend this higher. If you are designing a new application, read these docs. You will almost certainly learn something deeply useful that you'll carry me with you for the rest of your career.


Great resource and very well written articles. I'm always curious how these come to be when done really well. I wish we could generate high quality, useful, correct, internal documentation like this at my company, but I've never figured out how to get that done in practice. The people that know the concepts are not natural/gifted writers, or at least wouldn't be motivated or prioritize doing this. Even if they did, the style/tone/depth would vary greatly. For these msft docs, they somehow achieve consistent tone, depth, and quality/correctness across so many domains and I just don't know how that's achieved.


The main issue is, i can't stand C# or OOP syntax to illustrate patterns. Why class here ? For God sake, please use simple functions to prove the points.


I think it only reasonable then for you to show us how to do it in COBOL, Lisp, or some other 1950s programming language.


They might supply Python examples if you ask. A lot of Microsoft documentation has Python as well as C# these days.


Just want to say thanks for pointing out this resource. Will make for some great long weekend reading!


Sometimes I wonder how many other industries allow a worker to stumble upon a document that they then read outside of work, despite the details often being kind of difficult to understand at first, frustrating even, but we do it literally as a pastime, for pleasure. I love that.


Never come across this before. The collection of articles one level up from your link is an incredible resource.


MSDN is like the Wikipedia of coding problems. You documentation for things that have nothing to do with MS there


Super helpful, thanks!


One thing I notice is the over-usage of sharding, especially hash-based, might turn your Relational Database into just another key-value store, with consistency constraints moving into application code, and you lose many advantages of a traditional RDBMS


Online use cases always(?) scale out to key-value stores. Offline use cases almost always scale to distributed, date-partitioned columnar stores.


What's the alternative as things get too big? Sharding by date? By client? How do you prevent hotspots?


the only real answer here is to shard by customer

optimistically, you can try to shard by read use case, but that's never gonna be stable over time

if you need a true multi-tenant system you can only shard by individual entity and move all of the composition logic to the next layer up, there's no way to cheat


You’re exactly right. I work for a large cloud database service and the vast majority of our top customers shard by customer. This also gives you the benefit of using higher levels of sharping abstractions that map to performance SKUs for more demanding customers, allowing much more efficient allocation of COGs.


> the only real answer here is to shard by customer

No.

Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that. If composite, order properties used in most-to-least distinct value distribution.

> if you need a true multi-tenant system you can only shard by individual entity and move all of the composition logic to the next layer up, there's no way to cheat

This is incorrect. Sharding and multi-tenancy are orthogonal concepts.


> No. Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that. If composite, order properties used in most-to-least distinct value distribution.

what??

if you shard users by user ID and orders by order ID, then a query that joins a bunch of user orders in the same tenant namespace will spread across multiple user shards and multiple order shards

value distribution doesn't really have any impact here

(shard keys are also guaranteed to exist by definition, not clear what you mean by that)

if you don't care that specific queries cross sharding boundaries, okay, then no problem, but in that case sharding is not solving the problem that we are talking about here

> Sharding and multi-tenancy are orthogonal concepts.

sharding and multi-tenancy are only orthogonal if you don't care that a single tenant can have information on multiple shards


>> No. Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that. If composite, order properties used in most-to-least distinct value distribution.

> what??

> if you shard users by user ID and orders by order ID, then a query that joins a bunch of user orders in the same tenant namespace will spread across multiple user shards and multiple order shards

Note my recommendation of picking a "stable, guaranteed-to-exist, shard key."

If there is a users table/collection sharded by its id and an orders table/collection sharded by its id, then there is no "guaranteed-to-exist shard key" between them, right? So, in that case, where the two are often accessed together, having a "guaranteed-to-exist shard key" of the "tenant namespace" would be the logical choice.

> value distribution doesn't really have any impact here

I mentioned value distribution strictly in the context of ordering composite shard keys (if applicable). My apologies for any confusion this might have introduced.

> (shard keys are also guaranteed to exist by definition, not clear what you mean by that)

My implication was in reference to what is always available across accessing sharded entities. In the scenario you describe, sharding by either user or order id would not be ideal. In situations where one or the other is not sharded, then identifying a common shard key likely is not needed.

>> Sharding and multi-tenancy are orthogonal concepts.

> sharding and multi-tenancy are only orthogonal if you don't care that a single tenant can have information on multiple shards

My assertion was regarding theory, not specific scenarios. In practice, having multiple tables/collections needing shards with the tenant being the common entity strongly implies sharding on a tenant property and not of those unique to each table/collection, as implied in your reply.


>Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that.

This is a pretty risky approach since it's almost certainly the case that you won't end up evenly distributing your data across shards using this method.


> This is a pretty risky approach since it's almost certainly the case that you won't end up evenly distributing your data across shards using this method.

Distributing data across shards is a function of the properties selected to use for partitioning. So I do not understand how "a stable, guaranteed-to-exist, shard key (composite or atomic properties)" is "a pretty risky approach."


more recent customers/users/accounts probably do more actions than very old accounts though, how is that not also eventually creating uneven shards?


While high volume multi-tenant "customers/users/accounts" systems are common, they are not the only ones which benefit from sharded persistent stores.

For example, consider a system which monitors farm equipment for Caterpillar and John Deere. Lets say each company has 100k devices which send one message per day to the system.

While it is easy to envision sharding device messages based on "DeviceId / Company" in this hypothetical system, there would be no value sharding the two customers.


you're right, uneven shards are an inevitable outcome of this approach

but shard "even-ness" is in direct tension with the concern of the GP, which is execution atomicity

frequently, it's better to have uneven shards (that you can e.g. scale independently when necessary) that give you atomic execution, than even shards that require distributed transactions


Pass a certain point, you ought to think about whether to keep using a RDBMS as a K-V store or switch to a real distributed K-V store like Cassandra, ScyllaDB, DynamoDB and the like

About hot spots, it has always been an issue with K-V stores, and the only real solution is a good key design, though there are some tricks:

* Use a uniformly distributed but deterministic key prefix. For example, instead of using raw user_id as key, attach a small hash before it: (hash(<user_id>), <user_id>) This can help with load distribution if your <user_id> is not uniformly distributed by itself such as a phone or id number.

* Add more data to your key to increase cardinality. For example, with time series data, instead of using object_id as partition key, use (user_id, time_bucket) so the data for a busy object will get split into different partition over time.


Or just change to user_uuid (or similar non int pk)


Sure, unless your sharding/partitioning keys need to be derivable from the business keys.


Distributed databases like Yugabyte, Cockroach, TiDB, etc.


Sorry if I was unclear. Hoping to hear what strategies were practical within a given DB. A whole different platform is usually a much larger undertaking.


Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that. If composite, order properties used in most-to-least distinct value distribution.


Not so easy at scale because there’s additional requirement for it to also not be susceptible to hotspotting


Yes, this is where caching layers come in. Now your cache acts as the RDBMS.


The first link in the article [1], to a blogpost about the etymology of the definition of "shard" that we now think of as the primary meaning of the word, is super interesting.

The release of Ultima Online doesn't really seem too close to any inflection point of the word on Google Ngram, but I'm not sure exactly how close we should expect it to be to 1980 or 2000.

1. https://www.raphkoster.com/2009/01/08/database-sharding-came...


Shards are the secret ingredient in the webscale sauce. They just work.



Thanks for this article planetscale. I've been enamoured with sharding recently but more for multithreaded performance.

I want multimaster postgres.

I started trying to write a postgres synchronizer, by sorting every data by row and column and hashing the data of every column and row, then doing a rolling hash of the data.

In theory, two databases can synchronize by sending the final hash of their data and then doing a binary search backwards until the hashes match. This way you can work out which parts of the databases differ and need to be transmitted. If the databases are identical, very little data gets transferred.

One problem I've not worked out is how to decide which database is the winning database without having to change application queries.

If you synchronize two multimaster Postgres databases that have had independent writes to different sections, how do you identify which database is the source of truth for a column/row combination?


> One problem I've not worked out is how to decide which database is the winning database without having to change application queries.

I believe this is a case of the "Two Generals' Problem"[0], which implies that there is no provably correct solution to achieve this.

> If you synchronize two multimaster Postgres databases that have had independent writes to different sections, how do you identify which database is the source of truth for a column/row combination?

You can't without a quorum[1] and even that does not guarantee success.

0 - https://en.wikipedia.org/wiki/Two_Generals%27_Problem

1 - https://en.wikipedia.org/wiki/Split-brain_(computing)


> You can't without a quorum[1] and even that does not guarantee success.

Isn’t split-brain largely resolved by the rules:

1. Must have a strict majority of the total group agree to commit

2. Total should be an odd number


Only if there can only be one simultaneous split.


If all nodes lose traffic to each other (which is plausible with bad config changes), good luck!


"a binary search backwards" is not well defined and doesn't guarantee any upper bound on consistency

sharding data based on individual rows within tables is tricky, you won't get reliable guarantees for queries in this way

"which database is the winning database" is a function of individual rows, a query that reads data from N different row-owners needs to query N different instances, or else accept that it will work against stale data


Spock may be of interest to you: https://pgedge.github.io/spock/


You need Citus


Citus follows a single-master architecture for replication just like standard postgres. Citus would be a good solution to add additional writable nodes via sharding tables accross multiple databases.

Recently they've been acquired by Microsoft & have started to been integrated into their CosmosDB offering. Anybody have good experience with this DBaaS offering?


It does, but if you’re sharding do you still need multi master?


And the best sharding? Doing multi-tenant.

Sharding at table level is very complex and expensive. Fully give a single DB per tenant is very practical, and the reasons to do sharding (like reporting) is where you do the other fancy things (like ship events in to kafka, etc, etc).

Also, most issues of scalability are dominated by a few tenants that consume most resources, and distribute the loads is more easy per-tenant.


I think we need intuitive tooling to let developers continue just thinking in multi-tenant terms, with single-tenant behind the scenes.


I've been doing that as much as possible. however you're still left with availability issues, such as replicating / failover management etc. Which really are orthogonal issues to tenancy.

How do you manage that ?


You need at least 1 backup and/or replica.

Availability is kinda good: You could lose one/few tenants but that not take down all the rest (if they are isolated properly).

Is MUCH worse if all the db made the fancy way get down, or worse, you get a cascade of latency and/or crashed by the interlinked nature of the "scalable architecture that is fact share-everything global singleton".

And you can get a bit fancy (not done it myself, my customers tolerate a bit of downtime) of fork the tenant and upload it another node.


Issue is when your primary tenant(s) are much larger than the other...eventually vertical scaling is difficult even for only 1 tenant, depending on it's size.


I have been doing that in a previous project. One DB or one Schema per tenant can also simplify backup/restore, rollback, export, and data boundaries.


I haven't heard again about DB sharding technique since prior 2010. This technique was used to separate, in a given table, most involved fields from those less used by using different server for each shard. With the rise of memory database like Redis, sharding was abondonned.


> For Amazon, that means the orders table and the products table containing the products in the orders table need to be physically colocated.

Isnt this some wrong simplification? Product number can stay the same, but be a different thing over time. Product 12345 can be a book on first January 2023, while a comic book on first March 2023 . The idea is that product versions can change over time. (What is mostly abused by scammers who farm fake reviews and bait and switch products).

So probably they have some product history table, or save the product information in the order.

Also sharding in big data environment is when your data does not fit into Excel anymore, so you have 20 Acess files (1GB database limit in Access).


Does anyone else use planet scale but find it extremely slow? I'm using planetscale trial, simple queries can easily take 3-4s "cold", and then seem to get a little faster once presumably I hit cache.


Hey, I work for PlanetScale.

Definitely not normal. It’s hard to know why you’re seeing slow queries without more information.

The most common causes are either: missing indexes or network latency between the app and database (are they in the same region?).

We don’t have cold starts, but it is possible for queries to get faster once data is moved into memory. 3-4s is very slow though, I suspect it’s doing a full table scan and an index will solve it.

If you check Insights you can get more info, would also help to run an explain on the query (https://planetscale.com/courses/mysql-for-developers/queries...) to see what’s happening.

Also, if you email support, they’ll help debug it for you. Hope that helps!


Is the hash based approach scalable? A list operation would require contacting every single shard which seems outrageously expensive. It seems like you want some locality but I’m not aware of locality-preserving hash functions (seems like a contradiction but maybe people here have encountered one?)


It works because most operations don't need to "list all of everything"


Aren’t most SQL queries a table scan at some point? I guess you’d shard the index on range and the actual data can be hash sharded, but I don’t know if that buys you much since now the index and data are on separate machines. + you probably need to completely disallow queries on unindexed tables (ie you’re ClickHouse not spanner). That being said there is also interesting work done in the auto-indexing field that might provide a way out of the problem (ie generating an index transparently for the range seeing your hottest queries) but I think you’re still left with the amplification problem of the machines that need to get hit to access the underlying value.

Also, I’m not saying “list all”. I’m saying even list 10 or list 1000 is the same problem - you still have to contact all servers in your cluster to do a map/reduce to get the result. Sure, list operations may be less common but their cost seems exponentially more expensive.


If you're using Mongo, the answer is "only with an Enterprise support contract and Ops Manager" :)


Are you going to make the joke or...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: