SQL Server Performance Optimization – Is Multiple Databases Sharding for Read on Docker a Good Idea?

clusteringoptimizationperformancequery-performanceshardingsql server

So I have a server with multiple CPU cores and it installed a database. Do you think that if we install multiple databases (data sharding) with docker, then each request will go to different databases is a good idea? The data is different in each database and based on the request from a client it will query on the different database.

I don't know this can be worse than a dedicated database with all data or not?

Best Answer

The gotcha with sharding is that the application has to know which shard to query. Generally, this is done by sharding on something like client. I'll adapt one of my old blog posts to use as my answer.

When you’re building an application for lots of clients, there’s two common ways to design the database(s):

  • Option A: Put all clients in the same database
  • Option 2: Build one database per client

Putting All the Clients in the Same Database

It’s simple: just add a Client table at the top of the schema, add a ClientUsers table to make sure people only see their own data, and away we go.

Benefits of this approach:

Easier schema management. When developers deploy a new version of the application, they only have to make schema changes in one database. There’s no worries about different customers being out of sync or on the wrong version.

Easier performance tuning. We can check index usage and statistics in just one place, implement improvements easily, and see the effects immediately across all our clients. With hundreds or thousands of databases, even the smallest change can be difficult to coordinate. We can check our procedure cache contents and know for certain which queries or stored procedures are the most intensive across our entire application, whereas if we’re using separate databases per client, we may have a tougher time aggregating query use across different execution plans.

Easier to build an external API. If we need to grant access to our entire database for outsiders to build products, we can do that easier if all of the data is in a single database. If the API has to deal with grouping data from multiple databases on multiple servers, it adds development and testing time. (On the other hand, that “multiple servers” thing starts to hint at a restriction for the one-database-to-rule-them-all scenario: one database usually means all our load impacts just one database server.)

Easier high availability & disaster recovery. It’s really, really simple to manage database mirroring, log shipping, replication, and clustering if all we have to worry about is just one database. We can build a heck of an infrastructure quickly.

Putting Each Client in its Own Database or Shard

You still need a client listing, but now it becomes a directory - for each client, you also track the shard it lives in. On startup, your app queries this table, and caches it in RAM. When it needs data for a client, it connects directly to that shard (database & server).

Benefits of this approach:

Easier single-client restores. Clients are unreliable meatbags. (Except mine – they’re reliable meatbags.) They have all kinds of “oops” moments where they want to retrieve all of their data back to a point in time, and that’s a huge pain in the rear if their data is intermingled with other client data in the same tables. Restores in a single-client-database scenario are brain-dead easy: just restore the client’s database. No one else is affected.

Easier data exports. Clients love getting their hands on their data. They want the security of knowing they can get their data out anytime they want, avoiding the dreaded vendor lock-in scenario, and they want to do their own reporting. With each client’s data isolated into their own database, we can simply give them a copy of their own database backup. We don’t have to build data export APIs.

Easier multi-server scalability. When our application needs more power than we can get from a single server, we can divide up the databases between multiple servers. We can also spread out the load geographically, putting servers in Asia or Europe to be closer to clients.

Easier per-client performance tuning. If some clients use different features or reports, we can build a specialized set of indexes or indexed views just for those clients without growing everyone’s data size. Granted, there’s some risk here – by allowing schema differences between clients, we’ve just made our code deployments a little riskier and our performance management more difficult.

Easier security management. As long as we’ve properly locked down security with one user per database, we don’t have to worry about Client X accessing Client Y’s data. However, if we just use a single login for everyone, then we haven’t really addressed this concern.

Easier maintenance windows. In a global environment where customers are scattered around the globe, it’s easier to take customers offline for maintenance if we can do it in groups or zones.

Which one is right for you?

There’s no one right choice: you have to know your own company’s strengths and weaknesses. Let’s take two of my clients as examples.

Company A excels at hardware performance tuning. They’re really, really good at wringing the very last bit of performance out of hardware, and they don’t mind replacing their SQL Server hardware on a 12-18 month cycle. (They refresh web servers every 4-6 months!) Their Achilles’ heel is extreme compliance and security requirements. They have incredible auditing needs, and it’s just easier for them to implement bulletproof controls on a single server, single database than it is to manage those requirements across thousands of databases on dozens of servers. They chose one database, one server, many clients.

Company 2 excels at development practices. Managing schema changes and code deployments across thousands of databases just isn’t an issue for them. They have clients around the world, and they’re processing credit card transactions for those clients around the clock. They need the ability to spread load geographically, and they don’t want to replace servers around the world every 12-18 months. They chose one database for each client, and it’s paying off as they start to put SQL Servers in Asia and Europe for their offshore clients.