Hacker News new | past | comments | ask | show | jobs | submit login
Living with single-tenant and multi-tenant architectures (medium.com/schibsted-engineering)
109 points by mkasprowicz 4 days ago | hide | past | favorite | 39 comments

There's a word missing from this article - SaaS.

If you aspire to deliver the same software solution to multiple customers then most likely you are in the SaaS business.

And as Marc Benioff famously observed a long time ago, "Multi-tenancy is a requirement for a SaaS vendor to be successful.".

I think about this often because this was a crucial thing we got wrong at my last company.

We got a few large enterprise customers early on, which was great. But each had some unique requirements at the time. With hindsight, they weren't really that unique at all, but there was only a single customer asking for each one.

We took the decision to use separate databases (schemas in the Oracle world) for each customer. That way we could more easily roll out features to individual customers. We were careful to keep only a single codebase, I'd seen that before. But still, any customer could be on their own version of that codebase at any time, with their own schema that matched the code at that version.

I now think of this approach as maybe getting into heroin (no direct experience). Feels great, powerful, you can do anything. But ultimately it will kill you, and the longer you do it, the harder it is to get back onto the path of righteousness - a decent multi-tenant architecture.

I've work on projects where we did it the opposite way, with multi-tenancy as the default. It didn't work. For a couple customers, we had to carve out their own dedicated resources. A few of them were absolutely murdering the performance of the rest of the cluster(s), and some had business requirements to be completely isolated. Customers with similar requirements and workloads we kept in a multi-tenant pool.

Even though some of the benefit of multi-tenancy is (supposedly) simpler management (one set of resources), multi-tenancy can actually become more difficult as the customer pool gets bigger, or workloads get more uneven. Maintenance on the whole pool becomes more and more problematic, and you try to patch around or delay it by scaling vertically. You basically run into every possible problem and hit limits you hadn't thought of sooner than with single-tenancy. And worst of all, it's impacting more and more customers.

Multitenancy means you can mix servers together, not that you must mix them.

Traffic shaping can help with customers that need a special SLA, Even routing some traffic to servers that no other users can access. You can also corral bad actors in the same way.

The secret is to do a hybrid thing. You’re completely multi-tenant but build it so there can be many copies of shared resources and associate a tenant with one specific copy of each.

So customer A is actually on load balancer #1, elasticsearch #3, db cluster #2, app server asg #4, etc.. Then when you need to carve out separate resources for VIP customers you just add a new number and only assign them to it.

Multi-tenant architecture with horizontal scaling!

That can work, but even that can get incrementally complex as you (or your customers' workloads) scale. There are also limitations between those components you mentioned that sometimes necessitate further separation. To deal with all that within the hybrid or multi-tenant architecture, you end up spending a lot more time on tooling just to be able to manage it, and that takes away time from more useful improvements and maintenance.

It's like spoon theory[1]. As a product team, you only have so many spoons. Every bit of difficulty you spend on maintaining a system's operation takes away spoons that could be used for other parts of the product. Regardless of whether you use multi-tenant or hybrid or single-tenant, if it starts to take away all your spoons, you should be ready to try a different model and see if you get some spoons back. (I think this applies to all aspects of a product team, not just operations)

[1] https://en.wikipedia.org/wiki/Spoon_theory

That is because you did not have 2 levels of multi-tenancy. first you have schemas inside one database. Next you can have multiple databases where these schemas live. If you want a single tenant to a database. Simply make a database with one schema for those resource hungry tenants. Tooling is key here. Java spring, for example, can do that. Not that easy though and requires a lot of knowhow for a lot of edge cases.

The main problem with multi-tenancy like that is migrations. It can take a lot of time. There are 2 strategies: backward and forward compatibilities. Or, green blue deployments with ISITIO or equivalent tooling (e.g. if one server is behind it is still routed to the old front-end).

> Not that easy though and requires a lot of knowhow for a lot of edge cases.

The key for implementing a multi-tenant applications in Spring is AbstractRoutingDataSource. It is not that hard, see https://stackoverflow.com/questions/49759672/multi-tenancy-m...

Still sounds like a multi tenant system just with some dedicated shards. This is fine as long as the code based and db schemas are in sync.

I think there's another word missing: "data sovereignty". Depending on your business, your customers might need to keep user PII within their country. Having a single tenant solution makes this possible (just stand up a server in a data center in their country and have it communicate only within their country).

It is really just another set of tradeoffs and I think the author does a good job of detailing them from the perspective of company with internal customers. With external customers, the calculus can change.

> But ultimately it will kill you, and the longer you do it, the harder it is to get back onto the path of righteousness - a decent multi-tenant architecture.


The nice thing about a multi-tenant architecture is that it enforces consistency and therefore gives you scale. This can be achieved with single tenant as long as you are ruthless about it (that's what we have done at my current job).

We run separate servers for each client (they can also self host). We built a system to let clients control their version, so they can stay at an earlier release. But everyone stays on the mainline codebase and, critically, database schema.

But sticking with a multi-tenant architecture for a publicly facing application will make it easier to enforce that consistency. That may lose you some sales, but will lead to better scalability.

> I think there's another word missing: "data sovereignty". Depending on your business, your customers might need to keep user PII within their country.

That's assuming a single multi-tenant saas footprint.

There is absolutely nothing preventing you from standing up a footprint in each compliance region and customers are assigned to the region that satisfies their requirements for data location.

You get the benefit of less footprints to manage, while still meeting the requirements necessary to serve your customers.

And if you have a customer that absolutely must have isolated infrastructure, stand one up for them and pass along the increased cost associated.

It is worth emphasizing an aspect of your point when you mention "compliance regions"- this does not necessarily mean geographic regions or different countries.

This setup is how every company I have worked for that has had government clients handles them- a dedicated 'footprint' isolated from your commercial footprint, even if it remains on the same cloud provider. AWS even has GovCloud regions specifically for this scenario.

The architecture that we’ve chosen is a hybrid, with two levels of tenant separation. The core of the application is in a multi-tenant single database. The general rule of thumb is that _no_ PII can land here, only operational data (what data is PII vs operational can be fuzzy).

We also have the ability for specific modules of the monolithic application to have their own database that is scoped to that module (we’re using Elixir, so client-specific implementations are scoped as an umbrella application).

Then we have satellite systems. These are _usually_ edge servers that have client-specific configuration and branding on what they share to the web. They may also have their own database used for various other purposes. Most of these are built so that the data within is isolated from the core system and can be put in appropriate locations for data residency.

I (briefly) worked at a company who fell into this trap. They ended up having an entire department built just to manage the versioning of client instances.

For our monolith, we have a separate database per tenant. Databases are hosted on several servers (around 10k-20k databases per server) -- to spread the load. Additionally, each region (EU, US, etc.) has its own infrastructure. Having separate databases is very useful in that we have complete isolation of business data, it allows some performance improvements (full table scans are less harmful), and it's easier to investigate client problems for our support team, because you only see what you need to see. However, obviously, CPU and disk are shared and one especially active tenant can degrade the service for others (we're working on ways to throttle them).

Our roll-outs first apply migrations, then deploy the code. Migrations are applied by iterating all databases, and it sometimes it can take for up to several hours, before the code is finally deployed (so that it can use the new schema in every DB). It creates a very large window where old code can see new DB schemas, so we have to be careful for our migrations to be forward- and backward-compatible.

Microservices have a different approach: usually there's a single database for all tenants (a database per microservice, of course), what is sharded instead is tables. There's tables like "user_0", "user_1" et cetera; they are created automatically when needed. It allows some degree of isolation (although several tenants can occupy same sharded tables), but the main benefit is that scanning such tables is faster. The migration mechanism can enumerate all such sharded tables and apply migrations to them one by one. For data isolation, there's a requirement that each table must have an "accountID" column which must be always checked in each access to the repository on the infrastructure level (otherwise it shouldn't pass the code review). Account ID itself comes from the JWT token from the request headers, so a malicious tenant can't access other tenants by just changing the account ID in the request. Business logic doesn't pass account ID's around in function signatures, it happens transparently on the infrastructure level (it's passed to the repository constructor when building the service graph in the dependency container).

"This is not a big deal in Comments because our API is not public and user accounts are shared across all newsrooms."

This was kinda scary to read. just because an API isn't documented doesn't mean it isn't public! Broken Access Control is the top OWASP issue in 2021: https://owasp.org/Top10/

Visiting the sample 'Comments' link and looking at the network console in Firefox revealed this call: https://cmmnts-api.i.bt.no/v1/publications/e24/article:e24:a... DESC&replies=ASC

Doesn't look like they protect (probably because it is called from javascript). That means that the malicious can nose around and create mischief.

For multi-tenant architecture, the following topics are always top priority for me:

- Shared compute with tenant context passed via JWT

- Data isolation by either physical separation (i.e. separate database) or logical separation (i.e. separate schema, table, or column association) depending on requirements

- Enforcing tenant context at the API gateway

- Always leveraging policies and ACLs via JWT to enforce secure data retrieval

- Sometimes using RLS within the database

- Either universal data encryption or per tenant depending on requirements

- JWT is fine and webscale but plain sessions are also fine. Associating logins with tenants is the important bit.

- Shared compute is actually the part that to me means diddly squat and customers seem to prefer dedicated. It costs nothing to spin up more stateless-ish app servers dedicated to a tenant. It’s the db, logs, caching, load balancers, queues, monitoring I don’t want to split up. Also nothing is still wrong with normal sessions stores in Redis.

- Separate schemas are not preferred but fine kinda but at very least don’t create separate db accounts per tenant. The credential/connection management will make your life a living nightmare and doesn’t work with SQL proxies.

- We must seriously have vastly different JWT experiences. Every super businessy app I’ve made hits the ceiling fast of how much junk you can store in the JWT before having to punt to the db for user permissions.

- RLS is dope and you should choose it every time when you can. Not having to do #customers schema migrations is worth it.

>- Data isolation by either physical separation (i.e. separate database) or logical separation (i.e. separate schema, table, or column association) depending on requirements

It's interesting to me that you and @abraae seem to take the exact opposite view on the topic of data isolation where he/she has a much... harsher opinion:

>We took the decision to use separate databases (schemas in the Oracle world) ... I now think of this approach as maybe getting into heroin (no direct experience). Feels great, powerful, you can do anything. But ultimately it will kill you.

Of course this doesn't apply to the case of column association, but I'm interested on your take on this.


If your app is super-sensitive and absolute security is an absolute must then sure, using a separate database for every customer will help you answer those pesky security questions "How do you ensure that resources are not accessed by other tenants?".

My experience though is that if you are after velocity, and ease of maintenance, then you need a single database and tables with a "tenant id" column in them.

Even simple things get hard when you have separate databases. Say I want to know how many customers have how many widgets on average. If everyone's in one database, that's a SQL query. If they're all separate, it's ten times harder to answer.

This isn’t possible if you’re expecting all transactions to be in a single table, at some point you’re going to have to share things making any query at scale more complex.

Can you clarify? You can vertically scale a database a few orders of magnitude. My view is that “some point” is a long way off for most OLTP workloads.

> - Sometimes using RLS within the database

When is it not a good idea to leverage the database's RLS for access control?

It pretty much always is, but people are very wary of doing anything directly in the database these days, even stuff that's security critical and should apply to every query.

I mean it’s not super common, people usually opt for separate servers/schemas first. I’ve only been at one shop that’s actually done multi-tenant with RLS.

How did it turn out for them?

The AWS Well-Architected SaaS Lens covers different strategies nicely (the concepts are of course independent to AWS): https://docs.aws.amazon.com/wellarchitected/latest/saas-lens...

I wonder how well row level security policies really work. There are many interesting articles in the AWS documentation for this e.g. "Multi-tenant data isolation with PostgreSQL Row Level Security" https://aws.amazon.com/de/blogs/database/multi-tenant-data-i...

Multi-tenant databases feel like the result of a decision where a team either doesn't know how to architect data models well or doesn't want to put the effort into doing so. Referential integrity was solved 50 years ago. To demand that one's data not be commingled with someone else's in a database is as arbitrary as demanding it not be transmitted by the same pool of network connections used by the server. Our data must not reside within the same physical disk storage or memory as that used by other customers!

To turn this practical: A good security model makes the right thing happen by default, and makes doing the wrong thing hard.

A secure data model should make the tenant identifier necessary to successfully complete a query. Haven’t composite keys (including composite primary keys) been around since SQL86?

A good application layer, likewise, enforces that a tenant identifier is set on every endpoint, with no additional code, just from creating the endpoint.

The issue is backups.

We're a B2B/Enterprise SaaS and most tenants require that we erase all their data at the end of the contract. Some require customer-managed encryption keys. The only way to meet this requirement is to have every tenant isolated in their own database (and their own S3 bucket etc). If data is mixed, when one tenant leaves you must go through all copies of all backups, purge their rows, then re-save the cleaned up backups. Nearly impossible in practice.

Crypto-shredding is not an option? https://en.wikipedia.org/wiki/Crypto-shredding

In practice, I'm not sure where to begin with that strategy if you have a single .sql dump file with ~10 tables containing sensitive data from 50 different tenants. Even if theoretically possible, I'll take 50 files like tenantN.sql for both saving and restoring sanity.

I think it depends. The concept of a tenant is a spectrum that ranges from simple end user to big company with thousands of users. Depending on what your application provides, different solutions are useful.

Discriminate tenants by Id in the database? This can go wrong for countless reasons. I would use a separate DB for each tenant.

I would tend to agree but maybe it's not a fit for them.

I'd say in this case the obvious problem is the comment is associated with a tenant. The comment should be associated with a user which should (probably) be associated with exactly one tenant.

This is a great reminder of all the problems I don’t think about because I use Lambda, API Gateway and DynamoDB. I still think a lot about IAM policies and queuing systems, but scaling compute and databases are no longer a concern.

Do you not think about relational integrity?

I do, but the techniques are different. Adjacency list are the primary tool. But there is generally a lot less normalisation. Keep related data in the same item or next to each other in the list. Sometimes you over-fetch and filter for the data you need client-side.

For DynamoDB you want as little relational data as possible, everything must be 'denormalized' by default. Joins will kill you in these (nosql) databases.

Denormalized doesn't mean it's not relational.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact