Hacker News new | past | comments | ask | show | jobs | submit login
Lessons learned from sharding Postgres at Notion (notion.so)
471 points by craigkerstiens on Oct 6, 2021 | hide | past | favorite | 121 comments



At Kami (kamiapp.com) we had to make a similar transition last year - we'd seen usage grow over 20x in a 6-month period on our classroom product due to the pandemic, and we were storing all our content in a single-primary Postgres database. We had plenty of headroom before but we were suddenly running out of write capacity, even after scaling vertically up to ~150 vcpu. Our workload is super write heavy so scaling out replicas wouldn't have helped much at all, and I was spending a lot of time manually vacuuming in our daily off-peak as we were burning transaction ids at an astonishing rate, ~800 million per day.

I looked at a bunch of options but ultimately managing a sharded database seemed like too much overhead for our small (but expanding!) dev team - so we decided to move our heavily loaded tables to CockroachDB instead, as since it's mostly Postgres compatible our transition would be easier, and it would be much easier to manage as it automatically balances load and heals. We're running cockroach on GKE and it's a really nice fit for running on kubernetes.

Ended up working well for us - we still have our smaller tables on PG but we want to move it all to Cockroach over time. There were some teething issues as we learnt what to monitor on it (read amplification is the big one here, though pebble has helped a lot vs the previous rocksdb), and it does need a lot of hardware at high scale (we're running 50 16-vcpu nodes for it) but overall I'm happy with it.


How have you found moving from a default transaction isolation level of read committed on Postgres to serializable-only on CockroachDB? I know of some somewhat-standard usage patterns that lean heavily on more lax isolation levels for acceptable performance(and may just deadlock otherwise).

I'm also curious how you have found dealing with legacy schemas and queries that might cause a lot of data shuffling due to poor data locality. Was it necessary to put a lot of time into the sharding strategy? Have you experienced new query hotspots?


The transaction isolation level hasn't been an issue for us - Our read/write patterns aren't super 'transactional' which may be why, we don't get a lot of conflicting updates as usually only one user has the permission to edit their content on the document. Arguably these tables could be on Scylla/Cassandra with eventual consistency even, though that would have been a harder transition and we'd have to spend more time making sure our application could handle that correctly.

We did have to make some schema changes in order to improve the data locality and avoid hotspots - CockroachDB did _work_ without doing those, but the changes improved performance massively. It wasn't super time consuming though, and the admin UI/monitoring in cockroach is nice for showing you any hot ranges.

The biggest one was our table which stores annotations on a document - on PG it's primary key was only (annotation_id UUID), and we had a secondary index on (document_id). This meant that on Cockroach the data for a document was spread across many nodes and querying all the annotations for a document would take ~40ms. We changed the primary key to (document_id, annotation_id) which co-located the data and that came down to ~2ms. Composite primary keys aren't ideal for ActiveRecord but the performance win was worth it.

We also had some tables with hotspots which was mostly just a matter of making sure they used uuid keys. It would be a problem though if we had a table with individual rows which were super hot or where the table didn't have enough data to get split into multiple ranges - though cockroach does have load based sharding to help here. It was only really necessary to worry about the higher load tables here, if it's below ~2000 qps or so it won't matter.


> Arguably these tables could be on Scylla/Cassandra with eventual consistency even

In a past company we managed to run a bank on Cassandra with eventual consistency. It was ... well, it was good CV experience for the DevOps guys. (I also didn't need much encouragement to take the 'no getting drunk while on-call' rule very seriously.)


You might consider switching UUID to a more sequential id, like ULID, to further improve index data locality.


I think that might be counterproductive on cockroach - the ideal key has locality for data accessed together, but also has writes generally spread across the key range. If all the new rows have a similar key prefix then the shard covering the end of the key range will be hot which limits scalability.

This is a difference from PG, where having all the writes be concentrated in the key range is helpful as the Btree nodes covering that will be more likely to be highly cached


does Cockroach really use key prefix, vs a hash of the key? I would have thought it would be important to avoid unexpected performance hits resulting from key generation approach.


It is easy enough--arguably easier, right?--to generate random keys, but if you always work with hashes of the key (which I am not claiming cockroach doesn't do, only that I would be frustrated if they do) then you lose the ability to do range queuries.


Cockroach lets you hash-shard your keys n-ways if you choose to through a feature called hash-sharded indexes (https://www.cockroachlabs.com/docs/stable/hash-sharded-index...) That's a trade-off: costlier range scans (a range scan turns into n scans) for avoiding a throughput-limiting hot spot.


This is a great read and primer for anyone looking to shard and know they need scale. Having helped a ton of people shard their data, my advice still remains don't do it until you know you have to. The old Citus data blog is a great resource for some guidance around best practices and how to's even if not using Citus.

That said some of the things you can do from day 1 make things way easier when the time comes later. Workspace ID/Tenant ID/Customer ID, having your data grouped that way makes it very shardable and saves you a lot of work later on in the case you didn't have that in your data model.

I'm not sure I buy that Citus/Vitess are magic, both are reasonably clear how they work and you can dig into it. At the same time I'd weigh the downsides of Citus (can't speak to Vitess) in that the online rebalancer isn't open source so at that point it's a proprietary product.


Vitess is pretty close to magic for a lot of use cases. The downside of its rebalancer is that the re-sharding process requires a full expansion of the shards that are being re-balanced, which is to say if you have two shards [-80] and [80-], and you want to split out part of the first shard you need to provision two new shards [-40] and [40-80], let the data copy into them, and then you can mark them as active and delete the old [-80] shard. In this brave new cloud world, that not as much of an issue as it would have been in the old data of your own datacenter.

The big problem everyone I have talked to about sharding runs into is managing the shards as you expand. In this case it looks like notion over sharded so they can spin out up to 480 physical nodes, but when they need the 481 it is going to be a nightmare, thats what Vitess gets you for free, expand to any number of shards and just never worry about it again


Resharding in vitess is not quite as magical as this post implies. My infrastructure team has tried resharding the cluster powering my app several times and each time it's failed.

It's been a while since the last attempt so I forget the details but I think the reshard replication lag was growing no matter what resources we threw at it. We've since scaled vertically instead and are working on architecture changes to reduce our MySQL load (which admittedly is extremely high).


I would encourage you to join the vitess slack channel and let them know of any issues you run into there. Resharding has successfully been done at massive scale, so its unlikely that the system can't handle the load. Here is the blog post from when Github re-sharded live under full github load https://github.blog/2021-09-27-partitioning-githubs-relation...


Thanks, I'll pass along to my infrastructure team. Glancing through the blog post, its quite light on details, and I wonder if we just have a very different data load pattern.

Our dataset size is actually reasonably small (10s of TB), but our transaction throughput is very high. Glancing at our dashboard, baseline is 250k/s, with sustained daytime load in the 500k/s region.


Agreed. Just to add to this we’ve seen reshards on live datasets that are over 10 petabytes in size.


don't do it until you know you have to

If you wait until you have to, then isn't it too late to plan and test your solution? So you suffer from degraded performance and/or outages while you rush out a solution.


They said wait until _you know you have to_, not wait until _you have to_. If you're monitoring your systems appropriately, you should know well before you have to. (modulo inorganic growth explosions, which are hard to predict.)

This is just a specific case of premature optimization.


Agreed - at a previous job someone tried to force me into implementing some horrible sharding solution they'd dreamt up. I didn't: three years later the project was nearly dead and the database had never gotten anywhere close to maxing out a single machine.


(Ozgun, Citus co-founder)

That's a good point. Sharding has an interesting trade-off. Early on, you don't need it and there's an overhead to distributing work across multiple machines (you're taking an additional network hop). Later when/if you need it, sharding becomes painful to introduce. You may need to change your data model for performance and move data.

For years, I also cautioned against introducing sharding as a premature optimization. Recently, we changed our approach and also introduced sharding on a single VM. I think that offers a pretty good trade-off. If you're interested in this topic, I’d be curious to hear what you think.

https://www.citusdata.com/blog/2021/03/20/sharding-postgres-...


Yeah Notion waited too long because their performance was atrocious for a long time.


I'm not sure what you meant by "online rebalancer" above, but shard rebalancer in Citus is open source since March 2021: https://www.citusdata.com/blog/2021/03/13/scaling-out-postgr....


Ex-Citus here. The open sourced shard rebalancer blocks writes to the shard being moved. Online rebalancer (closed source) uses logical replication and doesn't block writes to shards being moved, except for a brief period. Everything is the same except how shard moves are implemented.


Yeah, echo'ing on here to be extra clear. What is open-sourced holds a write-lock while rebalancing. So that it rebalances sure, but it's only marginally better than a dump/restore. You can still read yes, but I'm not sure of many applications that can be okay with no writes flowing to a table for hours while rebalancing is happening.


Got it, thank you! I'm just wondering about whether this limitation can be alleviated / worked around by combining sharding and replication. In that case, I would expect the primary DB cluster to maintain the write lock during shard rebalancing, while allowing writes to replicas (upon primary cluster's rebalancing finish, the roles would reverse and rebalancing would be applied to replicas, while primary is already write lock-free).


Understood. Thank you for the clarification.


>>” Workspace ID/Tenant ID/Customer ID, having your data grouped that way makes it very shardable”

Dumb question, if someone starts off by grouping their data only by Customer ID and then later needs to shard. Couldn’t all of the sharding problems go away if they simply created a new Customer_Tenant table to map customer_id into tenant_group_id


Sorry, wasn't clear enough, tenant ID = customer ID = workspace ID. There is often a benefit though to materializing that ID onto all child tables. If you take a salesforce.com example, every customer has an "orgID", from there have accounts and accounts have opportunities. It's not common for people to put the "orgID" on the opportunity from a data modeling perspective, but when it comes to sharding it gives you a big win.

If you have that tenant discriminator on all tables then it's easy to route it to the appropriate physical and logical shard right away vs. having to do some DB join first in your request to figure out where it goes.


Are you suggesting creating entire database schemas specific to a customer and put the customer/orgID in all table names themselves?


No - you put customer ID column in every table


Yep, exactly this. By having customer id on every table it lets you know how to route appropriate queries and construct them as well.


I’ve stumbled across this in databases I’ve inherited before and while I appreciated that I could easily do filtering and stuff without joining across tables (and based on the comments here, more easily shard when things scale), it seems to denormalize the data.

Is there a specific term for the trade off of denormalized vs more easily query/shard/etc?


> Is there a specific term for the trade off of denormalized vs more easily query/shard/etc?

No, it's pretty much just called that. A good DBA will be able to strike the right balance between normalization and performance using their intuition and experience, which is one of the reasons they're often paid very well despite working in relatively ancient ecosystems.


Generally, the downside of denormalizing is that you risk inconsistency, and some extra storage. The inherent assumption in this case is that the customer ID (or whatever the key is) doesn't change for any associated table, and in that world, inconsistency is not a big risk. So yea, you store some extra data but if you've designed your ids well, it's not that big of a cost.


If you have the customer-ID in the data it basically represents the "owner" of the data. Typically the data of one owner does not refer to the data of other owners. Also the ownership of given data probably never changes so updating it is not a problem.

I wonder if this could or should be a built-in feature in databases. It is "meta-data" meaning data about data, who owns it.


Notion seems like an interesting data storage problem. The vast majority of the data is in the `block` entity, which are organized into a sort of nested set, written/updated individually (user edits one at a time) but read in ranged chunks (a doc).

Off hand this seems like an almost worst case for PG. Since the updates to blocks could contain large data (causing them to be moved often) and there is one big table; it seems likely that the blocks for a single notion document will end up being non-continuous on disk and thus require a lot of IO/memory trashing to read them back out. PG doesn't have a way to tell it how to organize data on disk so there is no good way around this (CLUSTER doesn't count, it's unusable in most use cases).

Arm chair engineering of course - but my first thought would be to find another storage system for blocks that better fits the use case and leave the rest in PG. This does introduce other problems, but it just feels like storing data like this in PG is a bad fit. Maybe storing an entire doc's worth of block entities in a jsonb column would avoid a lot of this?


I would try to use a simple k:v system that is much easier to scale. Even S3 would be a good candidate. Maybe I am missing the point.


Could you explain that k:v system a little bit? Junior eng. here.


KV means key value. Essentially the suggestion is to organize your data as a big key value structure(hashmap-like). This term usually means data is not normalized/separated as in regular SQL.

for instance, in SQL:

User table, that has columns: id, email

Article table, that has columns: id, text, user_id

KV/noSQL equivalent:

Article document, that has properties: id, text, user_email


> PG doesn't have a way to tell it how to organize data on disk so there is no good way around this (CLUSTER doesn't count, it's unusable in most use cases).

Aren't tablespaces (https://www.postgresql.org/docs/10/manage-ag-tablespaces.htm...) supposed to help with that?

Haven't used them, I'm honestly curious


Tablespaces allow you to store tables in defined directories on the file system.

What I was talking about is controlling the ordering of the rows within a table on disk. If you are going to be reading some group of rows together often, ideally you want those rows to be contiguous on disk as a sequential read of a range is much faster than bouncing around to dozens of locations to collect the needed rows. This becomes more important for very large tables. Imagine a 5TB `blocks` table and you need to read 50 blocks to render a given notion doc but those blocks could be scattered all over the the place on disk, it's a lot more work and it thrashes the page cache.

PG doesn't normally make any guarantees about how rows are ordered on disk and it may move rows around when updates are made. It does has a CLUSTER operation, which re-orders rows based on the order of an index you give it, but this is a one time operation and locks the table while running. This makes it functionally useless for large tables that are accessed and updated frequently.

Some other databases do give you control over disk ordering, SQL Server for example has `CLUSTERED INDEX` which you can apply to a table and it'll order data on disk based on the index order, even for new insertions / updates. It does cost a bit more on the write side to manage this, but it can be worth it in some cases.


Got it, thanks.


>> “Why 480 shards? … 480 is divisible by a lot of numbers”.

This is an important point, doing things divisible by 12 gives you a lot of flexibility. It’s not a coincidence both time (clocks) and degrees (360) are multiples of 12.


Numberphile made a video about "highly composite numbers" / "anti-primes", which are numbers that have more factors than all smaller numbers. https://youtu.be/2JM2oImb9Qg

There is an oeis sequence of them that starts "1, 2, 4, 6, 12, 24, 36, 48, 60, 120, 180, 240, 360, 720, 840...", which notably does not include 480. https://oeis.org/A002182


That’s super interesting. It’s counterintuitive but 480 has the same # of factors that 360 has (which is 24 factors)

https://www.calculatorsoup.com/calculators/math/factors.php


so it's weakly highly composite?


I think you can still say it's highly composite, but doesn't qualify as an anti-prime when it isn't the lowest number with that many factors.


"anti-prime" is just Brady trying to rebrand an old boring name to sound cooler (as he does). "highly composite" has a perfectly well defined meaning already, trying to shift it to shoehorn in some artificial distinction between it and "anti-prime" is counterproductive. If the aim is to produce a more descriptive name for a similar sequence of numbers that includes 480, I think GP's "weakly highly composite" is much better. I might propose just "weakly composite" or "mediocrely composite".

Edit: Actually, this was (also) already done by Ramanujan and has a better name than any proposed here: http://oeis.org/A067128 - "largely composite numbers" which does in fact contain 480. Perhaps every sharded system should choose a shard count from this list?


Thanks for correcting me, but I don't think drawing a distinction between "anti-prime" (clearly specific) and "highly composite" (vague) is counterproductive. On the other hand, "highly" and "largely" mean roughly the same thing, so treating "highly composite" and "largely composite" as distinct in this way is linguistically counterintuitive and counterproductive. Likewise, "weakly highly" reads like ambiguous nonsense when you don't go out of your way to specially define it.


It's true that 480 isn't on the list, but you'll notice there is a big skip in the list between 360 and 720, so if you want something in between, 480 is a good choice since it's still a multiple of 12.


Yeah, to be clear, I think there are many important factors that go into choosing a shard count, and blindly jumping to something just because it's some platonic mathematical ideal seems naive. I just thought this is a neat related fact and is probably a good starting point if you're in the process of choosing a shard count.


This is the kind of engineering that it's almost impossible to screen in interviews (particular for HR-minded pals) and is a big part of what makes a 10x-100x engineer (which definitely DO exist).

Obviously not this thing just by itself, but this sort of knowledge, applied to everyday decisions that get compound over time.


But 480 is 2*240, so it's definitely a close contender.


there is a factorial sequence in it, makes sense. 24, 120, 720...


Not to mention 1, 2, and 6 :)

7! = 5040 shows up as well in the OEIS sequence. Buuut, 8! = 40320 does not...too many useless 2's in the prime factorization?


My rule of thumb is one factor of three and then as many factors of two as you need. Adding five is not much different from two twos or a two and a three. Ditto for seven.

The only downside imo is not having ten(s) be a balanced deploy number.


Howdy all, author here! Sharding our Postgres monolith was a huge undertaking for a small team. We got a huge boost from folks that joined within weeks before the switch-over date. If you’re interested in thinking about problems like this, I’d love to chat. Plus we’re also hiring across all roles—check my profile for details.

I’m happy to answer questions about the project here, feel free to reply below.


Nice write-up! Two questions:

- Can you share details on the routing? I.e. how does the app know which database + schema it needs to go to for given workspace?

- Did you consider using several databases on the same Postgres host (instead of schemas within a single database)? Not sure what's better really, curious whether you have any thoughts about it.

Thanks!


> Can you share details on the routing?

All in the application layer! All of our server code runs from the same repo, and every Postgres query gets routed through the same module. This means that it was relatively easy to add a required "shard key" argument to all of our existing queries, and then within our Postgres module consult an in-app mapping between shard key range and DB+schema.

Plumbing that shard key argument through the application was more difficult, but luckily possible due to the hierarchical nature[0] of our data model.

> Did you consider using several databases on the same Postgres host

If I recall correctly, you cannot use a single client connection to connect to multiple databases on the same host, and so this could have ballooned our connection counts across the application. This is not something we explored too deeply though, would love to hear about potential benefits of splitting tables in this way.

[0] https://www.notion.so/blog/data-model-behind-notion


Ah yes, good point about connections being DB-specific. The schema approach seems more light-weight in that regard. Thanks!


Followup question: does the sharding happen within the app that talks to the DB, or do you shard traffic before it hits them? In the former case the total number of DB connections required presumably grows something like n^2.


What were the limitations that required you to move all customers to a shared system at once?

Could you have selected some workspaces with lower traffic to migrate first? That would have decreased the load on the primary, potentially speeding up the replication, which is a flywheel to enable more customers to migrate to shards.


Good question, that was an option. The main motivating factor here was that vacuums were beginning to take dangerously long. O(weeks) to complete, independent of the load on the database. While migrating spaces in segments would have reduced the number of records future vacuums need to scan, we were already running against the clock to complete one vacuum prior to TXID wraparound[0]. To kick off replication for specific spaces we would have needed to write our shard key to all data owned by those spaces. That would further contribute to TXID growth, and was not something we were comfortable doing.

At the end of the day, this is something we could have explored in more depth, but we were ultimately comfortable with the risk tradeoff of migrating all users at once vs. the consequences of depending on the monolith for longer, largely thanks to the effort we put into validating our migration strategy.

[0] https://blog.sentry.io/2015/07/23/transaction-id-wraparound-...


Seems like something that might still be worth exploring, as if I’m thinking about this correctly, it would allow you to create new shards on the fly, and to migrate workspaces between shards while only locking one workspace at a time, and only for the amount of time required to catch up that single workspace.


Are you at all concerned about the selection of the workspace ID as the partition key? With every workspace on a single partition, couldn't a high-throughput workspace create a "hot" partition that will negatively impact other workspaces on that partition?

At the cost of potentially introducing more cross-partition queries, you might benefit from splitting up high-throughput workspaces. See strategy in https://d0.awsstatic.com/whitepapers/Multi_Tenant_SaaS_Stora..., pages 17-20.


Perhaps this is why they have 15 logical shards per physical machine. Seems like if one logical shard gets too hot, it could be moved to a new physical machine.


> [2] In addition to packaged solutions, we considered a number of alternatives: switching to another database system such as DynamoDB (deemed too risky for our use case)..

1) Can you please talk more about the risks of using DynamoDB or a similar NoSQL solution?

2) Did you consider Spanner, which is the SQL DB of choice within Google and is available on Google Cloud.

Thanks for the wonderful engineering blog posts!


Oh yes—this footnote is not knocking non-relational DBs, in fact we have deployed DynamoDB in production for other use cases today.

We were on a tight timeline due to impending TXID wraparound. Switching database technologies would have required us to rewrite our queries, reexamine all of our indexes, and then validate that queries were both correct and performant on the new database. Even if we had the time to derisk those concerns, we'd be moving our most critical data from a system with scaling thresholds and failure modes we understand to a system we are less familiar with. Generally, we were already familiar with the performance characteristics of Postgres, and could leverage a decent amount of monitoring and tooling we built atop it.

There is nothing inherent about non-relational DBs that make them unsuitable for our workload. If we were designing our data storage architecture from scratch, we'd consider databases that are optimized for our heaviest queries (vs. the flexibility afforded by Postgres today). A large number of those are simple key-value lookups, and a plain key/value store like DynamoDB is great for that. We're considering these alternatives going forward, especially as we optimize specific user workloads.

Re: Cloud Spanner: we didn't consider a cross-cloud migration at the time due to the same time constraints. Still sounds like a wonderful product, we were just not ready at the time.


Makes sense. Thanks for the detailed response!


Curious if you had any tables that were "universal" and weren't shardable, global stuff that isn't workspace-specific. Would these be replicated across all shards? Or just stored in some special db (in which case joining to them might be difficult).


Plenty! But not significant compared to page content in terms of storage or load. That unsharded data is stored in separate unsharded databases today, but each table could be sharded later using a different shard key (often user id).

Luckily we did not have many join queries prior to sharding. The few cross-database joins were trivial to implement as separate queries in application logic.

Our main concern here was referential consistency: if you need to write to multiple databases at once, what happens if one of your writes fails? This was not a problem in practice for us since (1) our unsharded data is relatively static and (2) there are very few bidirectional pointers between data across databases.

Long term there are more interesting problems to solve when distributing unsharded data globally. However, given that our unsharded data is less dynamic and consistency is less critical, we have many levers we can pull here.


The Notion blog has had some quality posts, but it seems to be missing an RSS feed.

You can subscribe via email (no thanks), and I realize there are some good ways to turn an email subscription into a feed (Feedbin.com handles this nicely). Still, I thought some public shaming here might encourage the to build this basic blog feature (that would help people follow their company!).


Anyone starting a tech startup will face this decision: Should I start with a database that has built-in support for sharding? Or should I just start with MySQL or PostgreSQL and defer the sharding question for later?

Notion chose to do manual sharding (aka application-level sharding). That's what you end up doing if you didn't choose a database that has built-in sharding from the get go, because it is extremely hard to switch to a different database technology. (Larry Ellison compares it to a catholic marriage -- there is no divorce!)

I skimmed through the article to find the critical piece of info I was looking for: rationale for doing manual sharding. The rationale supplied in this article is "we wanted control over the distribution of our data." That's a weak explanation. It's the kind of thing you say to justify the bad choice made earlier on: failure to choose a database that supports automatic sharding from the get go.


From the article:

> By mid-2020, it was clear that product usage would surpass the abilities of our trusty Postgres monolith, which had served us dutifully through five years and four orders of magnitude of growth.

This should answer your question. Just use a standard database and get on with coding features instead.

If you are successful enough to get to the point that sharding PostgreSQL becomes your bottleneck, you've won.

In addition, you really can't solve scaling problems up front. Where your bottleneck actually occurs will differ from where you think it will.


> Just use a standard database and get on with coding features instead.

Better still, choose a "standard database" that supports sharding.


Please do. And please be my competitor.

We will be happily building out features while your engineering team wastes time solving for scaling problems that you will never have because you don't have features your customers want.


> solving for scaling problems

Choosing database A instead of database B doesn't mean you're suddenly "solving for scaling problems". It just means you're better prepared to one day solve scaling problems, should your product take off.


Every choice has trade offs. And database sharding should be very low in your priorities when starting a project because YAGNI. In particular, I will actively laugh at anyone not choosing Postgres « because it does not scale enough » when starting a project. Postgres Pros overwhelm massively this imaginary potential Con.


Serious question: Let's say I'm using Django or Rails, and I just want to write the code with the ORM. What database am I using that you think makes sense?

I would see an argument that Notion's structure should have made this problem more important for them to resolve earlier in the process (like some document-based DB), but Postgres works well!


Probably CockroachDB - I'm not sure what other free/open source options there are. I'm sure some proprietary databases will do it too but I haven't used Oracle/MSSQL/whatever to know what they give you.


Are you an IT developer, who is developing applications for internal use? If so PostgreSQL is an excellent choice!

But if you're developing an app that scales to internet users consider Couchbase or some other database that matches your requirements.


Why wouldn't Postgres scale to internet users?


What open source databases support automatic sharding? I think their only other option would have been something like cockroachdb, which probably wasn't very mature when notion started.


Check out Implementations section on this page:

https://en.wikipedia.org/wiki/Shard_(database_architecture)


PostgreSQL supports native sharding out of the box. It's "manual" in that you have to set up table partitions and foreign data access yourself, but that's arguably still preferable to hacking together the whole thing as a pure application-layer concern.


It was my understanding that PostgreSQL supports table sharding out of the box on a single instance but not the ability to do distribute the processing across multiple instances.


Postgres supports FDW. Also you can use Aurora Postgres in AWS which can take you fairly far. Another option for many SaaS products you can simply run multiple clusters and assign a given tenant to a particular cluster.


On a related note, how far can a single Postgres instance get you?

If you had some behemoth with 32TB ram and 1PB storage could all of motion fit?

I’m obviously ignoring the obvious single point of failure here, but sometimes the simplicity could be worth it if you’re willing to handle that.

I’d be curious to hear about a site that’s in the Alexa 1000 architected how I describe.


> If you had some behemoth with 32TB ram and 1PB storage could all of motion fit?

Postgres has a bunch of low level locks and buffers (protected by locks) which are essentially single threaded. So even if you had a 500 CPU instance at some point you'd not be able to get more throughput out of it.

Of course it also depends on what you are doing. Large tables with high update rate are harder to handle than large tables which are insert only. My personal opinion is that Postgres tables with >100GB data (without indexes) are a starting to be a pain to work with, no matter how much RAM, CPU or IOPS you have.


Is there a relational database that could efficiently leverage such a single powerful machine?


I imagine you want NUMA aware software specifically designed for multi-socket systems; off the top of my head the ones I've heard of are IBM DB2, SAP HANA, and Microsoft SQL Server — although I've had a chance to actually use any of them.


There was a study done recently that no major DBMS product can scale past 64 cores for similar reasons that Postgres can't -- too much contention on shared resources like locks.


Not currently for relational databases specifically as far as I know, though it is possible to build one in principle. The kinds of use cases and applications that really benefit from that kind of deployment hardware tend not to have classically relational data models.


Right, and at that point you should be using some sort of NoSQL anyway.


sql server scales incredibly well with the hardware but so does its price


Postgres starts to struggle when you push it past 10TB, the details of which will depend on your specific data model and workload. I've seen it pushed to 50TB but I would not recommend it. The architecture simply isn't designed for that kind of scale.

There are fast database engines specifically designed for single servers with 1PB of storage, and these servers commonly have hundreds of cores. This is much more efficient for some use cases. It is the kind of thing you find on edge platforms designed to manage and operate on sensor data models. You build database engine internals very differently when working at this storage density and number of cores; many good architectural ideas in databases at much smaller scales become severe bottlenecks even on a single server.


Thanks for the reply - I'm not super familiar with databases. Do you have a few in mind (preferably open-source for reference)?

Also, are you referring to a single-instance of Postgres, or that Postgres would struggle with 10TB even with something like Citus or TimescaleDB?


I was referring to a single instance of classic PostgreSQL running on a single server. Scale-out databases are a completely different animal, it is difficult to compare them to scale-up databases because they make a different set of tradeoffs.

Extreme scale-up database engines are relatively rare and usually bespoke, I am not aware of anything close in open source. They only have one major advantage -- they are easily deployable outside the data center, e.g. at the edge, because it is a single self-contained box and relatively power efficient. Most people opt for scale-out because 1) there are many to choose from in open source and 2) it will work just as well as extreme scale-up for most applications, though you'll need more hardware for the same workload.

I've started to appreciate the utility of the extreme scale-up engines in real-world applications, though I've mostly worked on scale-out engines. They solve a real problem at the edge for data intensive applications but most currently people deploy in the data center so there isn't a pressing need.


Many companies I worked with stopped working with Notion in the past because of performance problems, the application was simply too slow. Has this changed with the new sharded architecture, as I haven't used it since probably last year?


From personal experience (not with Notion) once you fall behind it’s hard to catch up, and nobody declares an emergency until it’s already happening.

Doing big surgery takes time, meanwhile your workload continues to grow and grow. It’s not uncommon for a major win to only reset the clock by 6-12 months, and then you either have to run to the next one, or you have two teams working on separate angles at the same time. All the while people are learning as they go because the business settled on catch-up instead of capacity planning.

Being a little successful can be tough.


The app is a lot more snappy, not super fast, but usable.

Search is still pretty slow.


I can notice the performance no longer being the problem for me. The only gripe I have with Notion is too much flexibility, it’s easy to completely mess something up even by accident.


Agreed, our Notion ended up being a soup of disparate outdated notes all over the place with no rhyme or reason across a 480+ employee business.

I'm founding engineer of a new startup now and I advocated for Guru. https://www.getguru.com/

I like how information is marked stale/verified and the permission are much tighter. The data ladder is also better structured. I'll see how it works as our headcount increases.


You mean editing documents you're just taking a look?

Notion has a lock feature [1] on a document-level.

[1] - https://www.notion.so/Lock-page-content-d2b995727c0b483f9f35...


Curious, what app have you seen people migrate too?


What size data do y'all start worrying about vertical scaling?

Im surprised at the 500gb table / 10tb db sizes listed.

I start to worry when my db hits 1tb and any single "hot" table is over 100gb.

I know there isnt a one size fits all answer but I'm genuinely curious if I'm just too behind living in lessons learned ~5 years ago and and we are in a better place now.


Overall this sounds very similar to Etsy's sharding setup, which I think descends from Flickr's sharding setup. Interesting that this is still the way to do it.


Curious about thought on using something totally different like column oriented db, like ClickHouse, would've suited a block data oriented app like Notion better?

Does anyone with experience have any thoughts in favour or against such implementation?


Totally different usecase with clickhouse - the point queries there still slower than citus but it kicks its ass on multi row aggregations. For auto-sharded oltp db cockroachdb or tidb seem like way better options


+1. First thought i had when i saw this was like why is notion using a postgresdb in first place- i can see it being used for customer data (which probably doesn’t require sharding) but for the content itself, a nosql or colum oriented db is better fit. the question i have is what is notion storing in these sharded postgresdb?


Slightly unrelated, but notion's unofficial python API is cool AF! https://github.com/jamalex/notion-py

It's way more powerful than their official API. The official API can only interact with databases, which (as anyone who uses Notion knows) is a tiny subset of the overall things you want to use Notion for.

The unofficial Python API lets you have complete programmatic access to all notion pages.

The tradeoff is that certain things break. However, after reading the code, I found it quite easy to fix the problems I ran into (https://github.com/jamalex/notion-py/pull/345) and I suspect you'll be able to do the same. (I empathize with maintainers not prioritizing their open source work ahead of family life, business, etc.)

Was surprised that it was even possible to have a nice Python class for every possible Notion object, let alone control them and update them on the fly. I wish their official API would be as flexible. Maybe one day.


I'm curious about whether using Postgres native partitioning was considered? If so, I'd like to hear the rationale for not doing it.

It seems like you made the database sharding decision as a result of vacuum taking too long. Partitioning the problematic table (assuming you're storing "blocks" in 1 giant table) would have enabled per-partition vacuuming, which will avoid long running vacuum processes.


I still wonder so hard what a single bare metal server would have done in terms of performance. [0]

May have saved months of planning that may never have been required.

[0]: https://yourdatafitsinram.net/


You'll hit all sorts of other fun problems by vertically scaling.

- Many concurrent connections to a single database which Postgres has traditionally not handled well (though improved in recent versions).

- You're now on the hook for writing a database control plane.

- Backup and restore is much flakier since the data volume is so much larger. Lots of weird shit starts happening when you download tens of TB.

- In general, everything is a harder once you start nearing machine limits.


Yes I thought Postgres didn't recommend more than 700 connections per server (or maybe it was 300) and that was only three or four years ago.


This can quickly become untenable depending on the availability and contractual requirements you have. You're literally building on a single point of failure. And if it's really big and you design wrong, recovery could take weeks.


Put a slave next to it, that's not the point. AFAIK this is what stackoverflow still does and it works for them. Why not for notion?


Notion may have a much higher proportion of write volume than stackoverflow, given that most of stackoverflow's traffic must be from people who come in from Google, read the answers, and leave. Notion is productivity app, so I would guess most people are actively writing.


If StackOverflow jumped off a bridge, would you jump off one too?

You can't base a huge decision like that on "well it worked for somebody else". How did it work for them? What was their platform? What was their data profile? What were their requirements? What was their acceptable level of risk? It could be that they have so many caching layers and lose so little money that they are fine with the database being down for 5 days while they recover. Or that it doesn't actually work that fine for them, because who wants to broadcast that their system is shitty? Or that they've just been lucky.

It's not worth finding out the hard way. Build the best system you can with the time and money and expertise you have. Don't cheap out just because you think you can get away with it - especially on the critical stuff.


I have 2 questions:

* How do you setup hot stand-by for each database ?

* Do you have coordinated backup for all the databases, or are they backed up individually ?


This was so insightful, gave me a clearer picture of what all needs to be done during migration!


What is this, 2008? "Our team spent months architecting this migration". Maybe you should have gone with a cloud offering like Dynamo/RDS, or a serverless database like BigQuery/Aurora.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: