
Our Multi-Tenancy Journey with Postgres Schemas and Apartment - pwmtr
https://influitive.io/our-multi-tenancy-journey-with-postgres-schemas-and-apartment-6ecda151a21f#.vfxsvbsrg
======
Walkman
If you are interested in the topic of multi-tenancy, this Microsoft article is
the best I ever found: [https://msdn.microsoft.com/en-
us/library/aa479086.aspx](https://msdn.microsoft.com/en-
us/library/aa479086.aspx)

~~~
why-el
I completely agree. My first professional job was dealing with multi-tenancy,
and this article I had printed and sit on my desk for months. Extremely well
written.

~~~
infogulch
If you go with the separate-database model, how do you manage database
deployments and syncing it with application deployments?

~~~
Walkman
Example 1: You can stop all instances and wait for sycn/migration to complete,
and start all instances.

Example 2: (which is much better IMO) deploy the new version, but keep the
older versions running. Sync/migrate one-by-one and start the new version when
ready.

------
jacobkg
Everyone's experience is of course their own, but I wanted to provide a
counterpoint from my experience over the past 5 years. At NationBuilder we do
schema-based multi-tenancy with Postgres. We have over 30,000 schemas spread
across over a dozen databases. One of our databases has more than 5,000
customer schemas on it (the average is about 2,500). Nevertheless we are able
to run a Rails application, execute migrations, generate new schemas using the
Schema.rb file, add and remove indexes, and many of the other items mentioned
in the article.

Has it been challenging? Absolutely. But it definitely is possible.
Furthermore we have not found a better way to ensure the privacy of our
customer data (separate schemas is the closest we can get to guaranteeing that
a badly written join won't expose data to the wrong customer).

PS - I applaud the authors of the original article for publishing their schema
management code as the open source Apartment gem (which I wish had existed
when we were starting out)

~~~
barrkel
We do SaaS for the financial industry, and we started out creating a separate
virtual network with multiple VMs for each customer. Router firewalling,
iptable lockdown, IP whitelisting to customer IP range. Encrypted disks
(backed by SAN) for database storage with different keys per customer. Etc.

So we had to scale up our deployment process to multiple concurrent migrations
per customer pretty much from day one.

The downside of this is that customers need to be big to afford the constant
overhead of physically renting the machinery to serve them, administer them
etc. So we're moving towards multi-tenancy from the other angle, aiming at
smaller firms with less stringent data security requirements.

------
zwily
We use a similar scheme. The biggest driver for us is being able to move
tenants around between databases to balance load as our tenants grow.

~~~
jacobkg
+1 We are continuously rebalancing customers between database servers. An
added benefit is that doing a pg_dump/pg_restore removes all accumulated
database bloat which decreases size on disk and improves performance

~~~
abishekk92
How long does it usually take to do such restructuring? Also, are you able to
do this with minimal downtime? I have similar usecase, however, the export and
re-import usually takes quite a long time.

~~~
zwily
We use bucardo to move a schema to a different host with no downtime.

------
tener
Perhaps I'm missing something, but to me the obvious way to serve multiple
clients is to simply create multiple isolated deployments, including a
separate database server for every client.

Handling this on the application level feels like unnecessary complexity.

Any good reasons to go with the approach described in the blog post?

~~~
arethuza
What if you want to support a very large number of relatively small tenants -
having a separate server for each one of those could well be overkill.

~~~
Walkman
Not at all. You can do it with multiple ways, e.g. with Docker

~~~
angersock
Docker is not a silver bullet for this, look at the other posts talking about
ops, upgrades, and performance for the price.

------
Todd
One advantage of storing multiple schemas in one database is that a connection
can be shared (assuming a shared authentication strategy). This is because, as
mentioned in the article, the switch to one schema vs. another happens after
the connection. This can make it easier to pool connections and to share
servers across tenants. A tenant_id approach gives the same benefit without
the isolation, but a separate database (even on the same server) doesn't.

Edit: typo

------
chrismorgan
How about row-level security? As I understand it—without having actually
_used_ it—it should be able to get the same level of isolation as schemas (so
long as you stay away from incrementing, user-exposed IDs), simply making all
the other tenants’ data invisible.

~~~
15155
RLS doesn't allow for one very important administrative feature:

Data migration/export

Schemas allow an admin to excise an exact backup of a user's data without any
other users, RLS is a little less flexible in this regard.

------
stcredzero
_I don’t know exactly how to represent that in Big O, but I think it would
look something like O(WTFxN)._

If WTF is a constant, then O(WTFxN) == O(kN) == O(N)

~~~
vog
I don't think the author meant "WTF" to be a constant. It was defined to be
whatever complexity index creation has.

However, I still disagree with "O(WTF*N)", because "WTF" almost certainly
depends of the number of rows in the table (large tables take longer to index
than small tables), and that number becomes lower in the multi-tenancy
approach, because the individual tables are smaller.

------
Walkman
In my experience there is no point developing multi-tenant software nowadays,
because operating large number of isolated environments is a solved problem
and really cheap nowadays. You can buy VPS-es for less then 5$ today, you can
automatically create them, deploy the instances easily. You can scale either
horizontally or vertically (Docker can help both). It's not much more work,
but much more secure and more scalable and almost as cheap as a multi-tenant
solution would be.

~~~
rpedela
On the surface, you are right. However you often still need custom code to
handle the business logic. If a new customer joins you have spin up a new
instance. If a customer leaves then you have to delete the instance. How do
you automate backup and restore? Is there a central repo for record keeping
and how does that updated?

Sometimes multi-tenancy is a simpler solution because no instance fleet
management code needs to be written. With everything there are downsides to
multi-tenancy too so you have to weigh the pros and cons for the specific
project.

~~~
Walkman
You are right. Without multitenancy, you have to write the automation code,
however, it's much better IMO, because with multitenancy, you have to think
about tenancy ALL THE TIME not just that one time. So I think overall it's a
better trade off and still less code and less problems to solve and easier to
solve.

------
ibejoeb
I'd like to offer an alternative option to those currently confronting this
topic. Use a database that explicitly supports this kind of thing, such as SQL
Server RLS or Oracle RAS/VPD. (I'll note that PostgreSQL can do very similar
things rather easily, too, but it might provide less of a benefit under larger
loads. I'll write that up separately if there's interest.)

These feature work by with multitenant tables (e.g., tenant_id everywhere) by
allowing the application to initialize a predicate that is automatically
applied to every query executed.

For example, app executes "select * from orders where quantity > 10" for
tenant 10 and the database resolves it to "select * from orders where
tenant_id = 10 and quantity > 10".

With this configuration, you get both the isolation afforded by schemas and
the uniformity (thing DRY) if the data definitions. You're protected from
leaky queries and you can leverage the performance features.

Here are a few considerations:

Recognizing that, to the database, objects schema1.sometable and
schema2.sometable are different, it becomes clear that large schema-based
tenancy is quite a bit more work for the database. Each statement must be
compiled and executed separately, so if "select foo from bar" is one plan,
then for n schemas, you have n plans for the same thing. Those plans must
exist in memory. In the nationbuilder case in this thread, where the average
number of schemas per node is 2,500 that's a crazy amount of additional
overhead. Plan will be evicted, and you're going to be burning CPU recompiling
statements and regenerating plans.

Furthermore, you're going to be facing connection management issues and the
potential leakiness there. Improper application of connection pooling can just
as easily expose cross-schema data as cross-row data. In large system, just
the sheer number of connections may become burdensome, so connection sharing
will be important in reducing degradation from setup/teardown. Most vendors
offer fancy connection striping for smartly routing queries to appropriately
initialized application sessions.

If you're growing and you need to cope with a lot of data, you can now begin
to employ more advanced database features. For example, all vendors offer some
kind of data partitioning and parallel execution. This is going to let you
focus more on the physical structure of the data and arrange it in ways that
naturally support the kinds of queries you need to perform.

Likewise, it will simplify unloading a loading to analytical systems. It's
pretty easy to extract this month's orders from one orders table (even if that
table is physically all over the place and managed by some fancy storage
management feature), copy the file, and load it into Vertica. You're going to
get into building some novel system to do it for dynamic, distributed schemas.

Following that, it might be economically compelling. License cost might be
less than engineering cost. It probably will be. Some of the features I've
mentioned are separately licensed under some vendor models, and that's an
additional expense, but it's also one that can be deferred until needed. It's
really worth investigating whether you should be spending this much
engineering effort non-line-of-business matters, because it very well could
cost much more than implementing an existing solution.

