My startup currently does just this 'at scale', which is for us ~150 b2b customers with a total database footprint of ~500 GB. We are using Rails and the Apartment gem to do mutli-tenancy via unique databases per account with a single master database holding some top-level tables.
This architecture decisions is one of my biggest regrets, and we are currently in the process of rebuilding into a single database model.
FWIW, this process has worked well for what it was originally intended to do. Data-security has a nice db level stopgap and we can keep customer data nicely isolated. It's nice for extracting all data from a single customer if we have extended debugging work or unique data modeling work. It saves a lot of application layer logic and code. I'm sure for the most part it makes the system slightly faster.
However as we have grown this has become a huge headache. It is blocking major feature refactors and improvements. It restricts our data flexibility a lot. Operationally there are some killers. Data migrations take a long time, and if they fail you are left with multiple databases in different states and no clear sense of where the break occurred.
Lastly, if you use the Apartment gem, you are at the mercy of a poorly supported library that has deep ties into ActiveRecord. The company behind it abandoned this approach as described here: https://influitive.io/our-multi-tenancy-journey-with-postgre...
Happy to expand on this if anybody is interested. It's currently a cause of major frustration in my life.
Echoing this as well, I worked for Influitive and was one of the original authours of apartment (sorry!)
There are a lot of headaches involved with the "tenant per schema" approach. Certainly it was nice to never have to worry about the "customer is seeing data from another customer" bug (a death knell if you're in enterprisish B2B software), but it added so many problems:
- Migrations become a very expensive and time-consuming process, and potentially fraught with errors. Doing continious-deployment style development that involves database schema changes is close to impossible without putting a LOT of effort into having super-safe migrations.
- You'll run into weird edge cases due to the fact that you have an absolutely massive schema (since every table you have is multiplied by your number of tenants). We had to patch Rails to get around some column caching it was doing.
- Cloud DB hosting often doesn't play nice with this solution. We continually saw weird performance issues on Heroku Postgres, particularly with backup / restores (Heroku now has warnings against this approach in their docs)
- It doesn't get you any closer to horizontal scalability, since connecting to a different server is significantly different than connecting to another schema.
- It will probably push the need for a dedicated BI / DW environment earlier than you would otherwise need it, due to the inability to analyze data cross-schema.
I still think there's maybe an interesting approach using partioning rather than schemas that eliminates a lot of these problems, but apartment probably isn't the library to do it (for starters, migrations would be entirely different if partioning is used over schemas)
This leads me to believe that everything you mentioned is already subtly broken, it is the new DB/account model that just exposes it.
Is there something between the two solutions or pieces that could be modified that collapse the problem?
What about going with DB account per app account and using views to limit exposure to data. If user level views are applied before the business logic has access, then the death knell scenario should be avoided.
Migrations seem hard regardless, they have to be automated anyway and have a verified rollback/undo mechanism. Code should be multiversion aware if you are doing live upgrades.
I am personally a fan of read-only mode while doing the last phase of an ETL upgrade on the backend. Snapshot databases, run a bunch of Hive jobs and throw it back down on a new set of DBs.
In the cases you outlined where it was dangerous or fragile, I think those systems were already that way, and that fragility is a system of a different problem. Fix that and lots of other stuff disappears.
I agree that migrations are painful at the best of times, but dealing with the complexity of migrating a single database is far simpler than dealing with migrating hundreds of schemas:
- Migrations will first of all just take longer - you're multiplying the number of schema changes by the number of tenants you have.
- While in an ideal world migrations should be purely run within a transaction, occasionally performance considerations mandate that you run without DDL transactions - when some tenants fail and your migrations are in a partially completed state for some of your tenants and not others, it can be scary and painful.
- In my experience, almost no one approaches data migrations in a way that is purely backwards compatible 100% of the time without exception. You certainly can, but there's a significant tax associated with this, and if you're in a traditional scoped environment, you can often get away with the potential for errors in the minuscule time that a schema change is operating (of course, some schema changes aren't run in minuscule times, but those are the ones you're more likely to plan for)
Going read only during migrations is an interesting approach, but there's real business costs associated with that (particularly if your migration speed is multiplied by running it across tenants).
I don't want to say that you should never isolate data on a schema level, but I do think it's something that shouldn't be a standard tool to reach for. For the vast majority of companies, the costs outweigh the benefits in my mind.
A model I haven't fully fleshed out but which looked promising was a single source of truth, and then copying out to Docker instances of each tenant's database. Obviously the nature of data access may not make this practical, but would for our use case (end user is mostly read only, and most writes occur in background that we control)
Can confirm, here be dragons. I did a DB per tenant for a local franchise retailer and it was the worst design mistake I ever made, which of course seemed justified at the time (different tax rules, what not), but we never managed to get off it and I spent a significant amount of time working around it, building ETL sync processes to suck everything into one big DB, and so on.
Instead of a DB per tenant, or a table per tenant, just add a TenantId column on every table from day 1.
How does the architecture block major refactors or improvements? Are you running a single codebase for all your tenants, albeit with separate schemas for each?
Edit: on reading the link you included, it seems like a lot of the problems are on the Rails implementation of the architecture with ActiveRecord and Apartment rather than with the architecture itself.
Here I'll give you one: If you want to change a property of the database that will give your specific use improved performance, you have no way to transactionally apply that change. Rolling back becomes a problem of operational scale, rolling out as well.
What if you need to release some feature, but that feature requires a database feature enabled? Normally you enable it once, in a transaction hopefully, and then roll out your application. With this style you have to wait for N database servers to connect, enable, validate, then go live before you can even attempt the application being deployed, much less if you get it wrong.
For this architecture to work you'd likely have to be able to operate more than one release at a time and move customers along with their database. If you're able to do that then this becomes much less of an issue.
I’m not a database architect, but can’t you just add the column or whatever as a nullable field in one migration, update the application to write to the column, and then make the field non-nullable in a further migration?
As I say I’m no expert but this is what I’ve done previously.
Sure, this problem can be solved as can any other -- but there's a cost to it in development time. For every feature, the dev working on it had to do it in 3 steps to ensure backwards compatibility and handle partial migration failures gracefully. Imagine doing this for a small feature - the ops effort dwarfs the actual feature! Many small features would probably be dropped.
Multi-single-tenant actually makes it easier to do transactional schema upgrades because these operations only impact one customer at a time. No fiddling with middle-of-the-migration state.
It becomes a non-issue if you're able to go down for a few moments.
Not necessarily viable for some software though, it just depends on what you're running.
(Especially because you can apply the migration on slave and subsequently switch master, making this downtime as quick as the time your system needs to do a database switch.)
This doesn't apply to big services with millions of users, however.
> You roll the feature out incrementally to the users who are ready. You build backwards compatible features. Basic software engineering.
The parent mentioned having to support ~150 B2B customers, so the effort is amplified x100 — more than 100 individual customers databases have to be coddled as you’ve described, albeit they are stuck with poor tooling to manage changes across their unusual architecture.
While not a web-app, we too have ~200 B2B customers running our application and each one have their own DB. Some self-host, most others are hosted by us.
We have written our own tool to upgrade the DB schema, it's nothing fancy, just takes an XML description, compares with the current DB and makes changes. However it ensures the upgrade process is automated.
New versions of our application come with a new DB schema, and we also have an auto-updater for our software. We have a small launcher application which allows the users to select a few older versions of our software, so upgrades are almost always safe in the sense that if the user encounters a bug with the new version, they can try in the previous one.
If the DB schema upgrade tool fails for some reason, the application upgrade will be held back and we'll get a notification.
Combined this allows us to be quite aggressive with pushing out new versions, and to do so without manual intervention.
A limitation with this setup is that DB changes have to be backwards compatible. So we do have a few cases of "if this table exists do this, else do that" type logic and similar, but mostly it's not a big deal.
For example I recently had to refactor some old table where they had used five columns for some additional references (ref_1, ref_2 etc), into a new table of additional references. To handle this I just made a new view which would return either the refs from the new table, or the data from the columns in case the new table didn't have any data for that record.
I then changed the application to use a grid instead of five separate edit controls, and to use this new view. If the user had to use an older version, that version would just write to the five fields and not populate a row in the new table, and the view would then return this data if the user later viewed it in the new version of our application.
So the "overhead" of backwards compatibility in this case was just ensuring this would be sufficient and writing that view, so just a few minutes.
Doing the same thing to 100 databases is usually easy to automate.
I managed a group that had a dozen customers on a single multi-tenant database and some migrations took more than a day - we needed to set maintenance windows on weekends (and once we had to roll back after an error).
Having a dozen databases we could roll out updates to on a client per client basis would have saved us from some enormous headaches.
For me it falls into the category of decisions that are easy to make and difficult to un-make. If for whatever reason you decide this was the wrong choice for you, be in tech needs (e.g. rails) or business needs, merging your data and going back into a codebase to add this level of filtering is a massive undertaking.
Indeed, but if you are making a B2B enterprise/SMB SaaS, I think you are most likely to regret the opposite choice [1][2]. A lot of companies run a single instance, multitenant application and have to develop custom sharding functionality down the road when they realize the inevitable: that most joins and caches are only needed on strict subsets of the data that are tenant specific.
If you get successful enough in this type of application space, you reach a mature state in which you need to be able to:
* Run large dedicated instances for your largest customers, because either their performance or security requirements mandate it.
* Share resources among a large number of smaller customers, for efficiency and fault tolerance reasons.
You can get there in two ways:
* You start with a massive multi tenant application, and you figure out a way to shard it and separate in pieces later.
* You start with multiple small applications, and you develop the ability to orchestrate the group, and scale the largest of them.
I would argue the latter is more flexible and cost efficient, and requires less technical prowess.
You can always take a multi-tenant system and convert it into a single-tenant system a lot more easily. First and foremost, you can simply run the full multi-tenant system with only a single tenant, which if nothing else enables progressive development (you can slowly remove those now-unnecessary WHERE clauses, etc).
See my sibling comment with the Hubspot example. Even though the system might work internally, other things will break if you start having duplicate account IDs because other systems don't think of the account ID as a cluster-local identifier, but as a global one.
Just thinking through this, but if it's an entirely separate environment, just host it on a separate subdomain and the account id becomes irrelevant. If you have a functioning staging environment, you already have the ability to run an isolated environment without issue, this is just a different application of the same idea.
You can probably run the environment itself, but other systems (release management, monitoring, billing, etc) probably rely on the account_id being unique.
I think you are talking about problems going from multiple single-tenant systems to a single multi-tenant system. You parent is talking about the opposite.
In the simple, YAGNI implementation of this, when you create a new HubSpot account, most likely that will insert a new row into the accounts table, and the auto generated ID of that row will be your account ID. Therefore you need uniqueness to be enforced at that level.
If you want to start running a separate copy of the system, you need to refactor the system to move that sequence out of the database so that two different customers running on different clusters don't end up with the same account ID. This is just an example, but there are many problems like this that are caused by the assumption that the production system is a single unique system.
Everything has a solution, but want to bet that at least 20 different internal systems at HubSpot assume that the account ID in that URL is globally unique?
In my experience by the time you reach this point you have a lot of operational complexity because you and your team are used to your production cluster being a single behemoth, so chances are it's not easy to stand up a new one or the overhead for doing so is massive (i.e. your production system grew very complex because there is rarely if ever a need to stand up a new one).
Additionally, a multi tenant behemoth might be full of assumptions that it's the only system in town therefore making it hard to run a separate instance (i.e. uniqueness constraints on names, IDs, etc).
Some of the issues I see in one of my projects is high interactivity between accounts. E.g. if account 1 'sends' something to account 2 both of the shared/separate db instances need to be up or there'll need to be some kind of queueing mechanism.
That's hard enough and then add to it that most clients want to BYOK to those instances
High interactivity between accounts is a good reason to not adopt the proposed multi-single-tenant architecture. The scenarios discussed are B2B enterprisey apps in which the environments are essentially independent from each other.
Yeah, it's far from a black and white issue. I am by no means against single tenant database structures, and for our initial use case it was likely the correct decision. I have not run into anything that stopped us in our tracks.
My pushback focuses on two things:
1: The Rails implementation of this, specifically with off the shelf gems and integration with ActiveRecord. Presents a lot of unknown framework complexity down the road. We are currently running on a fork of the Apartment gem as it breaks under newer versions of Rails.
2: Long term support for evolving business case. Now, this is very much a problem unique to our company and how our business/product has evolved. We started out with a very clear separation of account concerns, and as we've grown and pivoted things have changed. You are unlikely to experience the same evolution we have. HOWEVER, all startups face these unknowns, and giving yourself as much flexibility early in the game will pay off down the road. There are no right answers here, you just have to think it through and weigh the costs/benefits. Maybe for you that flexibility is found in single tenant. For us it appears in hindsight that multi-tenant would have been the best long term choice.
I totally get why you are doing this, but I would think DB management and migrations would be really troublesome. Do you have a migration framework to manage migrations across all catalogs?
This architecture decisions is one of my biggest regrets, and we are currently in the process of rebuilding into a single database model.
FWIW, this process has worked well for what it was originally intended to do. Data-security has a nice db level stopgap and we can keep customer data nicely isolated. It's nice for extracting all data from a single customer if we have extended debugging work or unique data modeling work. It saves a lot of application layer logic and code. I'm sure for the most part it makes the system slightly faster.
However as we have grown this has become a huge headache. It is blocking major feature refactors and improvements. It restricts our data flexibility a lot. Operationally there are some killers. Data migrations take a long time, and if they fail you are left with multiple databases in different states and no clear sense of where the break occurred.
Lastly, if you use the Apartment gem, you are at the mercy of a poorly supported library that has deep ties into ActiveRecord. The company behind it abandoned this approach as described here: https://influitive.io/our-multi-tenancy-journey-with-postgre...
Happy to expand on this if anybody is interested. It's currently a cause of major frustration in my life.