Postgresql – How to model parent-child relationship correctly in multi-tenant, multi-schema Postgres database

database-designpostgresql

I have a multi-tenant SaaS app where each tenant gets their own schema. When a user creates an account, I create a new schema and run a bunch of migrations to create tables under that schema. Then, when the user wants to authenticate, they provide their tenant info (this can be their subdomain, e.g. mycompany.coolapp.com, or it can be a unique account ID), and the app routes subsequent queries to that tenant's schema.

However, certain pieces of tenants' data will be public. For instance, the user will need to create invoices and email their links (containing a UUID) to their customers. When the customer clicks the link, the app needs to be able to figure out which tenant the invoice belongs to, based on the UUID.

To do this, I created a "public.invoices" table, with two columns: uuid and tenant. The latter contains the tenant's schema name, which I can then use to perform the subsequent queries (i.e. the invoice's actual data) to the appropriate schema.

Question 1: Is this design sound? Is there some other, more standard way of routing unauthenticated/pre-authenticated requests to specific schemas in situations where each tenant gets their own schema?

Question 2: How do I tell Postgres that, when the tenant deletes the invoice (the one in their schema), the public record (under "public.invoices") also should be deleted? I'm trying to avoid using two queries, if some database-provided option or feature exists. I looked at CASCADE, but it seems that only goes from parent to child, and as far as I can tell there isn't a way to define a foreign key relationship in this scenario where the tenant invoice is the parent record.

Best Answer

There are many ways to implement this, but let me suggest partitioning.

The table public.invoice would then be list partitioned, with the tenant being the partitioning key. Each partition sits in the schema of the respective tenant.

That way everybody can query their own invoices in their own schema or all invoices in the public schema.

The big advantage of partitioning here is that it becomes very easy to remove a tenant: simple detach the respective partition and drop it.

Note that partitioning becomes impractical with thousands of partitions. But if you plan to have that many tenants, it is a bad idea to implement that with thousands of schemas with identical tables in a single database. Having too many tables will slow down metadata operations, and it makes life harder for autovacuum (which in turn will make your life hard).

Related Solutions

Sql-server – Handling growing number of Tenants in Multi-tenant Database Architecture

At the lower end (500 tenants / 10000 users) this is how I did it. First, you have a "control" database that is global, central and contains all of the information about tenants and users (I really don't think you want to manage these as SQL auth logins). So imagine a database called "Control" with the following tables:

CREATE TABLE dbo.Instances
(
  InstanceID INT PRIMARY KEY,
  Connection VARCHAR(255)
  --, ...
);

INSERT dbo.Instances SELECT 1, 'PROD1\Instance1';
INSERT dbo.Instances SELECT 1, 'PROD2\Instance1';
-- ...

CREATE TABLE dbo.Tenants
(
  TenantID INT PRIMARY KEY,
  Name NVARCHAR(255) NOT NULL UNIQUE,
  InstanceID INT -- Foreign key tells which instance this tenant's DB is on
  --, ...
);

INSERT dbo.Tenants SELECT 1, 'MyTenant', 1;
-- ...

CREATE TABLE dbo.Users
(
  UserID INT PRIMARY KEY,
  Username VARCHAR(320) NOT NULL UNIQUE,
  PasswordHash VARBINARY(64), -- because you never store plain text, right?
  TenantID INT -- foreign key
  --, ...
);

INSERT dbo.Users SELECT 1, 'foo@bar.com', 0x43..., 1;

In our case when we added a new tenant we would build the database dynamically, but not when the admin user clicked OK in the UI... we had a background job that pulled new databases off a queue every 5 minutes, set model to single_user, and then created each new database serially. We did this to (a) prevent the admin user from waiting for database creation and (b) to avoid two admin users trying to create a database at the same time or otherwise getting denied the ability to lock model (required when creating a new database).

Databases were created with the name scheme Tenant000000xx where xx represented Tenants.TenantID. This made maintenance jobs quite easy, instead of having all kinds of databases named BurgerKing, McDonalds, KFC etc. Not that we were in fast food, just using that as an example.

The reason we didn't pre-allocate thousands of databases as the comment suggested is that our admin users usually had some idea of how big the tenant would become, whether they were high priority, etc. So they had basic choices in the UI that would dictate their initial size and autogrowth settings, which disk subsystem their data/log files would go to, their recovery settings, backup schedule to hinge off of, and even smarts about which instance to deploy the database to in order to best balance usage (though our admins could override this). Once the database is created, the tenant table was updated with the chosen instance, an admin user was created for the tenant, and our admins were e-mailed the credentials to pass along to the new tenant.

If you're using a single point of entry, it is not feasible to allow multiple tenants to have users with the same username. We opted to use e-mail address, which - if all users work for the company and use their corporate e-mail address - should be fine. Though our solution eventually became more complex for two reasons:

We had consultants that worked for more than one of our clients, and needed access to multiple
We had tenants who themselves were actually comprised of multiple tenants

So, we ended up with a TenantUsers table that allowed one user to be associated with multiple tenants.

Initially when a user logs in, the app will know the connection string for the control database only. When a login is successful, it can then build a connection string based on the information it found. E.g.

SELECT i.Connection
  FROM dbo.Instances AS i
  INNER JOIN dbo.Tenants AS t
  ON i.InstanceID = t.InstanceID
  INNER JOIN dbo.TenantUsers AS u
  ON i.TenantID = u.TenantID
  WHERE u.UserID = @UserID;

Now the app could connect to the user's database (each user had a default tenant) or the user could select from any of the tenants they could access. The app would then simply retrieve the new connection string, and redirect to the home page for that tenant.

If you get into this 10MM user area you propose, you'll definitely need this to be balanced better. You may want to federate the application so that they have different points of entry connecting to different control databases. If you give each tenant a subdomain (e.g. TenantName.YourApplicationDomain.com) then you can do this behind the scenes with DNS/routing without interrupting them when you need to scale out further.

There is a lot more to this - like @Darin I am only scratching the surface here. Let me know if you need a non-free consult. :-)

Row Level security or Multiple Tables

My bias is to use a single table with appropriate row-level security.

There are potentially huge maintenance advantages to a single set of tables. If you end up with n copies of each table, that means that you have to run n copies of each script every time you want to make a change. Frequently, that means that you end up with at least a few very slightly different versions of the application running at a time because someone forgot to apply script 7 of 23 in a monthly build to one set of tables and someone else created an index on one set of tables to address one customer's issues without adding it to every customer which makes debugging much harder.

A single set of tables has significant scalability advantages. Adding new customers just requires adding a new row to the customer table not deploying a new schema/ database with the new customer's copies of the tables. Adding new customers also doesn't directly add ongoing work for the DBAs. If you have separate copies of the tables, you need someone to deploy tables to create a new customer and every new customer means additional work for the DBA at least to run the scripts one more time every time there is a change.

A single set of tables may also offer performance advantages. If you're using a separate set of tables, each customer would realistically need a separate connection pool in the middle tier. It would defeat the purpose of having separate tables, after all, if your middle tier is connecting as a user that can see every tenant's data because then you'd be implementing row-level security in the middle tier and dealing with all the complexity of multiple sets of tables in the back end. That makes it tough to scale across servers-- do you create a connection pool for every client on every server? Do you send certain clients to certain servers? Do you not preallocate connections and incur the cost of waiting for connections to be established more frequently?

That being said, there are cases where separate tables might be preferred. If your customers are frequently large institutions, for example, separate tables will make it much easier to do things like move a customer to a dedicated box (or at least a dedicated VM) if the customer wants to upgrade to dedicated hardware so that they don't risk performance being affected by other customers. Those large institutions may want greater control over outages and upgrades so it may make sense to have separate tables to allow different customers to be upgraded at different times in order to work with that customer's schedule. Those institutions may find it easier to tell an auditor that their data is physically separate from all other customers rather than explaining that the data is physically intermingled but security controls are in place to guarantee row-level security. If you're going to have relatively few relatively large customers, the amount of maintenance overhead you introduce by having separate tables may not be particularly significant given the general day-to-day maintenance tasks that each client likely requires. In that sort of environment, different clients often have sufficiently different configurations that the problems they encounter are relatively unique even when they're running exactly the same version of the software.

Best Answer

Related Solutions

Sql-server – Handling growing number of Tenants in Multi-tenant Database Architecture

Row Level security or Multiple Tables

Related Question