Database Design – Using One Database for 50,000+ Shops

database-designsql server

I know Shopify use only one database for all shops.
But how can they handle their database with such a big data?
Is that a good idea to use single database for 50.000+ shops?

Best Answer

Please note: I'm answering from a SQL Server perspective, so I mention some concepts specific to SQL Server, but I believe all of these concepts have equivalents in other major RDBMS platforms, with similar benefits and limitations.

I will also probably continue editing this answer as I think of other potential pros/cons.

Well, it really depends on schema, volume, etc. What exactly is a shop storing? How is it different from storing data about 50,000 cats or 50,000 products or 50,000 wingnuts?

There are several reasons (other than just the size aspect on its own) why you might not want to store data for 50,000 different customers in a single database, if indeed the data can be completely segregated by customer (not including lookup tables like zipcodes or application-specific tables, which could go into a single, central database):

if one customer outgrows the application, there is no easy way to extract just their data and move it to another instance, server, etc. to scale out, unless you plan ahead and partition on something like CustomerID and have 50,000 filegroups (you're limited to 15,000 partitions anyway, or 1,000 if you're on an older version of SQL Server, and having too many filegroups can be disastrous). Also note that partitioning requires Enterprise Edition.
if it turns out that all of your customers are simply too big for this instance, scaling out means getting new hardware and moving the entire database there (and potentially doing that again down the road).
deleting a customer can be equally painful, as you will have to delete some % of rows from very big tables, and that will not be cheap.
you will likely have wide distribution of customer data (one customer with a billion rows, another customer with 5,000). This can lead to things like parameter sniffing and detrimental performance involving cardinality and plan quality (since you will likely be re-using the same plans for the same queries against very different data sets).
all of your customers are subject to the exact same SLAs and HA/DR plans. You either have the entire database in full recovery mode with n-minute log backups, or you are in simple and rely on full+diff backups. If you have to revert because of a customer error, or need to recover the database to a point in time, that affects every single customer.
there is potential for errors in data retrieval - bugs in where clauses, for example, could lead to one customer seeing another customer's data, or all of the other customers' data.
there may be legal implications (some companies will have strict requirements in place that you do not place their data in the same database as any other company, and particularly that of their competitors).
if security of any one customer's data is important, then achieving that is much easier using database separation than separation within a table.

Some advantages to having each customer in a separate database (or at least having multiple databases, each for a group of customers):

in terms of size, it will take about the same size on disk.
scaling out is easier, since you can just move a database (or many) to a different server.
deleting a customer and all of its data roughly equates to DROP DATABASE.
you're using more memory for plans (or you have fewer plans in cache per customer), but at least those plans are relevant to the data in their respective databases and are less prone to statistics / parameter sniffing issues.
you can easily have different SLAs and DR plans, placing some databases in full and others in simple. Also reverting or restoring to a point in time only affects that customer.
you can easily place different databases (say, your high priority customers) on faster I/O. You could do this in a single database with filegroups, but that is much trickier to manage (at least IMHO).

Some drawbacks:

size aside, you probably won't want to have 50,000 databases on a single instance of SQL Server, so this will probably mean scaling out to multiple servers.
startup time goes up because there is some inherent overhead in starting up each database.
the app has to be a bit smarter - instead of just having CustomerID on the where clause, it has to dynamically connect to CustomerID's database. This isn't hard with a proper middle tier but it is a change.
yes, you have many copies of the same tables and procedures, but code and schema are identical across databases, just the data is different. So deploying code/schema changes is now just a loop instead of a single execution.
maintenance is a little bit different when you're managing 50,000 databases - again the overall size is roughly the same but the process has to change - you can't just defrag / reindex / back up all 50,000 databases at once. Having said that, at my previous job I managed instances with 500-1,000 identical databases, and the difference between managing 3 identical databases and 750 identical databases is simply the time it takes.

Best Answer

Related Solutions

Database Design – One Database vs Multiple Databases: Pros and Cons

Postgresql – Is it ever a good idea to denormalize for integrity

Related Question