I am working on a new project and I am in the process of designing the Microsoft SQL Server database behind the system.
The system will contain an infinite amount of customers and each customer has multiple users. Each customer will have their own subset of data and the amount of data is going to be vast. Worst case scenario is approximately 5+ million rows per month across multiple tables per customer.
To optimize performance per customer I was considering a system containing one system database and multiple customer databases. My thought is mainly that I see no reason why one customer should query through other customer's data. The issue as I see it is that this solution would cause relationships between the customer databases and the system database (cross-database relationships).
Am I overthinking the performance issue?
Best Answer
Given that you haven't started coding the app or DB yet, but do have a sense of the scope of the data (even if that will change somewhat as time goes on): no, you are not over-thinking this. I wouldn't spend months pondering this one aspect of the system design, but to not think of it at all is a bit foolish since refactoring app code is far easier than refactoring databases (since databases include "state").
My preference for multi-tenant systems, especially if dealing with large amounts of data and/or great variation of data sizes between clients, is to go the Database-per-Customer route. There are obviously advantages and disadvantages to any approach, but here is what I see from having gone through this a few times and trying it both ways:
ADVANTAGES
CustomerID
field in every table (except lookup tables). This complicates indexing and performance tuning.DISADVANTAGES
Regarding the thought of having a single SystemData DB and the related concern of not being able to FK across DBs: how much system lookup data do you expect to have? Assuming it is not 1 GB of data, you can create a Schema in each customer DB called "System" or "Common" or "Lookups" or whatever, and load the data into that schema into all customer DBs. Then you can FK to those tables. And by having one DB per customer you will already need a release process that updates the code and schema of each customer DB in the exact same manner anyway (such that they are always in sync with a common structure), so updating lookup tables fits right into that process. The main downside here is that you are duplicating this data per each customer DB, so 200 times perhaps? But that is over the course of years. You aren't starting out with 200 customers, right?
Also, please keep in mind that Table Partitioning is not "sharding" (like you might be thinking of in terms of MongoDB or PostgreSQL/GreenPlum, etc). It is not meant to magically solve the problem of multi-tenancy or distributed data. It is meant to handle the problem of quickly and efficiently handling bulk data loads and unloads. And it does quite well at its intended purpose. And occasionally it even helps queries on very large tables (1 billion rows or more). But indexes and statistics are still per-table, not per-partition (though you can create filtered indexes/statistics per each partition key to help make up for this). And while you can (in more recent versions) reindex per-partition, starting with SQL Server 2012, indexes no longer get a "free"
UPDATE STATISTICS WITH FULL SCAN
when doing a REBUILD, if they are partitioned ! Non-partitioned indexes still do :-).For what it's worth, I worked on a multi-tenant system that did attempt to use table partitioning in this manner (was not my idea). It caused a lot of headaches, frustration, extra work trying to fix weird execution plans, and after a few years, was mostly agreed upon that it probably shouldn't have been done.