SQL Server Performance – When to Split or Partition a Very Large Table

performancesql server

Our site has some large but simple (INT, INT, DATE) tables for stats. Each table has up to 300,000,000 rows, and gets bigger every day.

The hosting provider has suggested that we split or partition the tables, and I have seen this recommendation elsewhere on numerous occasions.

However…

I am struggling to reconcile this advice with the stated max capacity for SQL Server – a database size of 524,272 terabytes, with table rows limited only by "available storage".

Based upon those figures, the table described above could easily have centillions of rows (10 to the power of 303).

Ah ha you might say, there is a difference between CAPABILITY and PERFORMANCE.

But in virtually every question about SQL Server performance the answer is "It depends…. on table design and query design".

That is why I am asking this question. The table design couldn't be much simpler. Nor could the queries which are simple count(*) operations based on an indexed ID field.

Best Answer

There's a reason that the general advice is that it depends on the table design and the queries on it. My answer to your other post on Stack Exchange says as much. Saying "queries which are simple count(*) operations based on an indexed ID field" doesn't give much information since it says nothing of the cardinality of the set of rows under consideration. Things you can do to mitigate the (as of now perceived) problems are:

  1. Partitioning. Specifically, your data seems to be logging-type data. My guess is that you want to get stats by some unit of time (e.g. "widgets per day" or "whozits by hour"). Partition by your quantum (i.e. days or hours in the previous examples) and move partitions off to read-only file groups occasionally

  2. On a related note, if the data is write-once, consider pre-aggregating the data once the time period is no longer active. That is, why do I need to keep counting how many events happened on a day from three years ago if that data is never going to change? Once the day is over, count everything in that day, store it somewhere else, and never count it again. In fact, if you never have need of the detailed data (i.e you only ever do aggregations against it), consider deleting it after you count it. If you implement this idea, you can get even more clever with filtered indexes that cover only the "active" period which will make your queries faster because they will not cover the vast majority of your data

But, as my advice in the other post suggests, the only way you're going to know for sure is to load it up with a reasonable amount of data and try it out. All we can do here is say what will probably work in the general case. Without the specifics of your hardware, your data, and your queries, all we can do is guess. And, you may find that once you run the test that I'm proposing that the answer is "there's nothing to do" because it works just fine as is.