SQL Server – Joining Across Partitions

sql server

SQL Server documentation for partitioning states:

Where possible, minimize requirements for referential integrity across vertical and functional partitions.

If cross-partition joins are necessary, run parallel queries over the
partitions and join the data within the application.

https://docs.microsoft.com/en-us/azure/architecture/best-practices/data-partitioning

I do understand that lack of referential integrity will make writes much faster since there won’t be a need to check other partitions for referential constraints. But how will handling joins in the application be beneficial.The number of requests to the database server and the processing needed will be the same if it was done by an RDBMS join, wouldn't it? (And in addition, for application-side joins, there will be the burden of network round-trips between the application server and the database server.)

Best Answer

Be very careful not to conflate table partitioning with (and?) database sharding. This specific article doesn't do itself any favors by having such a generic and overloaded title, and you're not the first person to start referring to sharding as partitioning.

In fact, there is already a pull request to change the title to Horizontal, vertical, and functional data partitioning.

The minimize requirements for referential integrity bit implies that you shouldn't be sharding data that you commonly join. If your question is "when would handling joins in the application be beneficial?" it would be a narrow use case, probably where latency between the application and a shard is better than the latency between shards (think distributed app and geo-replicated data), and when all of the data is being cached closer to the application anyway (e.g. Redis), as most good applications will plan for at the outset these days. Make sure you have done proper research to determine that (a) you need sharding at all, and (b) that you should be separating data you plan to join a lot. There are almost certainly going to be separation tactics that minimize the need to join across shards or in the application.

Sharding works best when each node that you're writing to contains data that is largely independent and can live without any of the other nodes. When they depend on each other and you're joining between them, you've largely defeated the purpose of separating them in the first place. A good use case is when each customer/tenant has its own shard (or its own database on one of the shards). Most activity for any given tenant is focused on that shard. When central data is needed, it is often replicated to each of the shards, rather than joined across shards. This is the one of the models APS and Azure SQL Data Warehouse use.

would like to get to the bottom of why MS encourages application-side join instead of a db join

They are talking about aggregate queries there, and I think they are implying that you are concatenating the results, not joining them. For example if customer A is on shard 1 and customer B is on shard 2, and you want to show total sales by customer, the app can (even asynchronously) run an aggregate query on shard 1 for customer A, and an aggregate query on shard 2 for customer B, do that in parallel, and then just union those results together. No joining across shards. I think you are reading way too much into it by saying they are encouraging it or suggesting it is necessary.

And if you want an answer to why Microsoft says something in a very specific way, you're going to have to ask Microsoft. We can help you interpret what they probably mean, and even suggest scenarios where it makes sense, but we can't "get to the bottom" of anything.