Postgresql – Designing a multi-tenant database for a scenario with multiple user types

database-designdjangomulti-tenantpostgresqlschema

I'm developing a SaaS application for recruiting which has multiple user types such as Client, Recruiter, Panellist and Owner etc. There is also a chance that the user types may increase.

The Owner has an Organization and this is what will be my tenant. Here I'm considering every owner's organization as a tenant. I'm using Django to develop this application and will be utilizing the schemas feature provided in the PostgreSQL, hence there will be a schema for every tenant.

What I'm planning to do is that there will be a global User table that will be kept in the public schema and stores generic user information such a first name, last name, email, password, etc. The User table will be used for logging into the site. And there will be additional tables for each of the respective user types that each have their own unique set of columns. These additional tables will be tenant/organization specific.

Now once the user logs into to site, he can then switch to his desired tenant/organization something similar to how workspaces function in Slack. Within the tenant/organization, the user can assume the role of the above-mentioned user types that were allotted by the Owner to the user. So the user can be a Recruiter in one tenant and he can also be a Panellist in another tenant.

There will be an UserOrganization junction table which will keep track of the fact that a User can belong to multiple Organization.

Here is a schema diagram of what I'm thinking:

NOTE: I did not make the schema connections for the tables in the second tenant because it was starting to get messy but for all intents and purposes assume that they are present over there.

My questions are:

What is exactly the best practice when dealing with multiple users in a multi-tenant scenario?
What are the deficiencies of my design?

Best Answer

Using multiple schemas to implement a multi-tenant scenario is a feasible solution if you have few tenants. If you have too many tenants, you will end up with too many tables, and that will make life quite painful for you.

If you have many tenants, you could use one table for all tenants and use a column like tenant_id to identify them. If you don't want to trust to your application logic to restrict query results to the appropriate tenant, you could use row level security and let the database do it for you.

Another useful tool for the “single table” approach is partitioning. You can use list partitioning to split that large table in partitions based on the tenants. That way, you can bundle several (smaller) tenants together in a single partition. Partitioning will make it easier to delete all data of a single tenant (this is its main advantage), and it may save you an index on the tenant_id. Most queries won't get faster through partitioning though.

Related Solutions

Sql-server – Is this schema usable for multi-tenant clients?How to remove multi-cascade paths

To achieve multi-tennancy you should limit the number of relationships you have to the factor that determines the tennant.

For example, each of Course, Department and Staff have a foreign key to School. However, these child tables are also interrelated. Is it necessary for Staff to have a schoolid? You can get to the school a staff member works for by following the foreign key to Course.

Similarly, you have grand-child tables (SD and DSC) which are relating tables in curious ways. What has "months indicating duration of the course" to do with which staff member is teaching the course? Is it really necessary for DSC to relate to Staff? Also, you appear to have repeating groups defined in your SD and SDC tables. I believe you may have a normalization issue.

By over-relating tables you create cascade cycles. Consider the following ERD as an alternative:

ERD

Here there is only one potential cycle to be concerned with, instead of the several that are in your current model. Fewer potential cycles means less extra work that might need to be done with triggers instead of declarative referential constraints. This keeps your system simpler and more maintainable.

Sql-server – Composite Primary Key in multi-tenant SQL Server database

Having worked on a large-scale, multi-tenant system (federated approach with customers spread across 18+ servers, each server having identical schema, just different customers, and thousands of transactions per second per each server), I can say:

There are some folks (a few, at least) who will agree on your choice of GUID as the IDs for both "TenantID" and whatever entity "ID". But no, not a good choice. All other considerations aside, that choice alone will hurt in a few ways: fragmentation to start with, vast amounts of wasted space (don't say disk is cheap when thinking about enterprise storage — SAN — or queries taking longer due to each data page holding fewer rows than it could with either INT or BIGINT even), more difficult support and maintenance, etc. GUIDs are great for portability. Is the data generated in some system and then transferred to another? If not, then switch to a more compact data type (e.g. TINYINT, SMALLINT, INT, or even BIGINT), and increment sequentially via IDENTITY or SEQUENCE.
With item 1 out of the way, you really do need to have the TenantID field in EVERY table that has user data. That way you can filter anything without needing an extra JOIN. This also means that ALL queries against client-data tables are required to have the TenantID in the JOIN condition and/or WHERE clause. This also helps guarantee that you don't accidentally mix data from different customers, or show Tenant A data from Tenant B.
I'm thinking to add TenantId as primary key along with Id. And possibly add OutletId too. So primary key(s) in sales order table will be Id, TenantId, OutletId.

Yes, you should have your clustered indexes on the client-data tables be composite keys, inclusive of TenantID and ID ^**. This also ensures that TenantID is in every NonClustered index (since those include the Clustered Index Key(s)) which you would need anyway since 98.45% of queries against client-data tables will need the TenantID (the main exception is when garbage collecting old data based on CreatedDate and not caring about TenantID).

No, you would not include FKs such as OutletID to the PK. The PK needs to uniquely identify the row, and adding in FKs would not help with that. In fact, it would increase chances for duplicated data, assuming that OrderID was unique for each TenantID, as opposed to unique per each OutletID within each TenantID.

Also, it is not necessary to add OutletID to the PK in order to ensure that Outlets from Tenant A don't get mixed up with Tenant B. Since all user-data tables will have TenantID in the PK, that means TenantID will also be in the FKs. For example, the Outlet table has a PK of (TenantID, OutletID), and the Order table has a PK of (TenantID, OrderID) and an FK of (TenantID, OutletID) which references the PK on the Outlet table. Properly defined FKs will prevent Tenant data from getting intermixed.
Does the composite key order matter?

Well, here is where it gets fun. There is some debate as to which field should come first. The "typical" rule for designing good indexes is to pick the most selective field to be the leading field. TenantID, by its very nature, will not be the most selective field; the ID field is the most selective field. Here are some thoughts:
- ID first: This is the most selective (i.e. most unique) field. But by being an auto increment field (or random if still using GUIDs), each customer's data is spread out throughout each table. This means that there are times when a customer needs 100 rows, and that requires almost 100 data pages read from disk (not fast) into the Buffer Pool (taking up more space than 10 data pages). It also increases contention on the data pages since it will be more frequent that multiple customers will need to update the same data page.
  
  However, you typically do not run into as many parameter sniffing / bad cached plan issues as much since the statistics across the different ID values is fairly consistent. You might not get the most optimal plans, but you will be less likely to get horrible ones. This method essentially sacrifices performance (slightly) across all customers to gain the benefit of less frequent problems.
- TenantID first: This is very much not selective at all. There might be very little variation across 1 million rows if you only have 100 TenantIDs. But the statistics for these queries is more accurate since SQL Server will know that a query for Tenant A will pull back 500,000 rows but that same query for Tenant B is only 50 rows. This is where the main pain-point is. This method greatly increases the chances of having parameter sniffing issues where the first run of a Stored Procedure is for Tenant A and acts appropriately based on the Query Optimizer seeing those statistics and knowing it needs to be efficient getting 500k rows. But when Tenant B, with only 50 rows, runs, that execution plan is no longer appropriate, and in fact, is quite inappropriate. AND, since the data is not being inserted in the order of the leading field, this method creates a lot more page splits (hence more fragmentation) in a short period of time than the other approach.
  
  However, for the first TenantID to run a Stored Procedure, the performance should be better than in the other approach since the data (at least after doing index maintenance) will be physically and logically organized such that far fewer data pages are needed to satisfy the queries. This means less physical I/O, fewer logical reads, less contention between Tenants for the same data pages, less wasted space taken up in the Buffer Pool (hence improved Page Life Expectancy) etc.
  
  There are two main costs to getting this improved performance. The first is not so difficult: you must do regular index maintenance to counteract the increased fragmentation. The second is a bit less fun.
  
  In order to counteract the increased parameter sniffing issues, you need to separate the execution plans between Tenants. The simplistic approach is to use WITH RECOMPILE on procs or the OPTION (RECOMPILE) query hint, but that is a hit on performance that could wipe away all of the gains made by putting TenantID first. The method that I found worked best is to use parameterized Dynamic SQL via sp_executesql. The reason for needing the Dynamic SQL is to allow for concatenating the TenantID into the text of the query, while all other predicates that would normally be parameters are still parameters. For example, if you were looking for a particular Order, you would do something like:
```
DECLARE @GetOrderSQL NVARCHAR(MAX);
SET @GetOrderSQL = N'
  SELECT ord.field1, ord.field2, etc.
  FROM   dbo.Orders ord
  WHERE  ord.TenantID = ' + CONVERT(NVARCHAR(10), @TenantID) + N'
  AND    ord.OrderID = @OrderID_dyn;
';

EXEC sp_executesql
   @GetOrderSQL,
   N'@OrderID_dyn INT',
   @OrderID_dyn = @OrderID;
```
  The effect this has is to create a reusable query plan for just that TenantID that will match the data volume of that particular Tenant. If that same Tenant A executes the stored procedure again for another @OrderID then it will reuse that cached query plan. A different Tenant running that same Stored Procedure would generate a query text that was different only in the value of the TenantID, but any difference in the query text is enough to generate a different plan. And the plan generated for Tenant B will not only match the data volume for Tenant B, but it will also be reusable for Tenant B for different values of @OrderID (since that predicate is still parameterized).
  
  The downsides to this approach are:
  - It is a little more work than just typing in a simple query (but not all queries need to be Dynamic SQL, just the ones that end up having the parameter sniffing problem).
  - Depending on how many Tenants are on a system, it does increase the size of the plan cache since each query now requires 1 plan per TenantID that is calling it. This might not be an issue, but is at least something to be aware of.
  - Dynamic SQL breaks the ownership chain, which means read/write access to tables cannot be assumed by having EXECUTE permission on the Stored Procedure. The easy but less secure fix is just to give the User direct access to the tables. This is certainly not ideal, but that is usually the trade-off for quick and easy. The more secure approach is to use Certificate-based security. Meaning, create a Certificate, then create a User from that Certificate, grant that User the desired permissions (a Certificate-based User or Login cannot connect to SQL Server on its own), and then sign the Stored Procedures that use Dynamic SQL with that same Certificate via ADD SIGNATURE.
    
    For more information on module signing and Certificates, please see: ModuleSigning.Info
Please see UPDATE section towards the end for additional topics related to the issue of dealing with mitigating statistics issues resulting from this decision.

^** Personally, I really dislike using just "ID" for the PK field name on every table as it is not meaningful, and it is not consistent across FKs since the PK is always "ID" and the field in the child table has to include the parent table name. For example: Orders.ID -> OrderItems.OrderID. I find it much easier to deal with a data model that has: Orders.OrderID -> OrderItems.OrderID. It is more readable, and cuts down on the number of times you will get the "ambiguous column reference" error :-).

UPDATE

Would the OPTIMIZE FOR UNKNOWN Query Hint (introduced in SQL Server 2008) help with either ordering of the composite PK?

Not really. This option does get around parameter sniffing issues, but it merely replaces one problem with another. In this case, rather than remembering the statistical info for the parameter values of the initial run of the stored procedure or parameterized query (which is definitely great for some, but potentially mediocre for some, and potentially horrible for some), it uses a general statistic of data distribution to estimate row counts. This is hit-or-miss as to how many (and to what degree) queries will be affected positively, negatively, or not at all. At least with parameter sniffing some queries were guaranteed to benefit. If your system has Tenants with widely varied data volumes, this could potentially hurt performance for all queries.

This option accomplishes the same thing as copying input parameters to local variables and then using the local variables in the query (I have tested this but no room for that here). Additional info can be found in this blog post: http://www.brentozar.com/archive/2013/06/optimize-for-unknown-sql-server-parameter-sniffing/. Reading the comments, Daniel Pepermans came to a conclusion similar to mine regarding the use of Dynamic SQL that has limited variation.
If ID is the leading field in the Clustered Index, would it help / suffice to have a Non-Clustered Index on (TenantID, ID), or just (TenantID) to have accurate statistics for queries that process many rows of a single tenant?

Yes, it would help. The large system that I mentioned working on for years was based on an index design of having the IDENTITY field as the leading field because it was more selective and reduced parameter sniffing issues. However, when we needed to operations against a good portion of a particular Tenant's data, the performance did not hold up. In fact, a project to migrate all data into new databases had to be put on hold because the SAN controllers got maxed out in terms of throughput. The fix was to add Non-Clustered Indexes to all tenant data tables to be just (TenantID). No need to do (TenantID, ID) since ID is already in the Clustered Index so the internal structure of the Non-Clustered Index was naturally (TenantID, ID).

While this did solve the immediate issue of being able to do TenantID-based queries much more efficiently, they still weren't as efficient as they could have been if it were the Clustered Index that were in that same order. And, now we had yet one more index on every table. That increased the amount of SAN space we were using, increased the size of our backups, made the backups take longer to complete, increased the potential for blocking and deadlocks, decreased performance on INSERT and DELETE operations, etc.

AND we were still left with the general inefficiency of having a Tenant's data spread out across many data pages, intermixed with many other Tenant's data. As I mentioned above, this increases the amount of contention on these pages, and it fills up the Buffer Pool with lots of data pages that have 1 or 2 useful rows in them, especially when some of the rows on those pages were for clients that were inactive but hadn't been garbage collected yet. There is much less potential for re-use of the data pages in the Buffer Pool in this approach, so our Page Life Expectancy was pretty low. And that means more time going back to disk to load more pages.

Best Answer

Related Solutions

Sql-server – Is this schema usable for multi-tenant clients?How to remove multi-cascade paths

Sql-server – Composite Primary Key in multi-tenant SQL Server database

Related Question