Sql-server – Splitting a large table to improve performance

best practicesperformancesql server

This is a follow-up to an earlier question. I have a SQL Server 2008 R2 Standard server, that holds a single database, which itself has almost nothing except a large table.

The table is 100+ million rows (35 columns) and growing at around 250,000 rows per day. We need all the data to be "online", and most of the columns need to be searchable in some fashion. The vast majority of activity on the table is reading; apart from the new data being INSERTed during the day, there's no need to change anything.

Users perform a range of queries on the table, ranging from simple look-up-a-record requests to pulling tens of thousands of rows based on a range of criteria. We only have limited control on the queries that are run, and performance is starting to suffer, even with indexing.

A big part of the problem is disk I/O, which we're addressing by retrofitting a SSD-based array. As all database files will be on this new array, the consensus is that having multiple database files won't make any difference, but that splitting the table up into separate tables might be the way to go.

I'm now puzzling over what would be the best approach to this. Two ideas that I'm debating with myself:

  1. Split the table into "tiers"

    • A table containing the last week's data, which is the one being
      INSERTed into each day
    • Next table containing from last week
      back to 3 months previous
    • Next table containing from 3 months to 6 months
    • Next table containing anything older than 6 months

    I'd then "shuffle" the data down the tiers overnight (the database
    is only accessed 8am-10pm, so I have a window overnight to process
    data).

  2. Create tables for data ranges

    • Create a table for a data range – say per quarter. I'd then have
      the data INSERTing into the table 2Q2013, and then trip over to
      3Q2013, 4Q2014 etc …

    I could use filegroups to make older tables "read only" if this
    would improve performance.

Option 1 is the easiest for me to implement, but I'm not sure if this is a completely mad idea. Option 2 is a more work to implement and maintain, but if it's "best practice" for this kind of problem that it's the way I'll go.

Any and all advice or alternative ideas would be gratefully received – I'm away that these kinds of problems are best solved at design-time.

Best Answer

I personally would go with your first option. If you use a `DELETE dbo.p1 OUTPUT INTO dbo.p2' pattern to move the data there is not a lot that can fail. Moving 250K x 3 rows that way in the timeframe you have should not be a problem either if you do it in batches with around 10K rows.

The advantage I see over the more common calendar based partitioning is that your "partitions" stay the same size. With the amount of data you are dealing with a calendar base partitioning approach would work very well in the beginning of the month and potentially significantly slower at the end of the month.

I have written an article together with Kalen Delany about the advantages of the "moving data from partition to partition as it ages" approach a while back in sqlmag: http://sqlmag.com/database-administration/using-table-partitions-archive-old-data-oltp-environments

That article is using the enterprise-only build-in partitioning feature, but you can implement it using handmade multi-table-partitioning too.

I would try to place the different tables (partitions) on separate drives. Because your data in the older partitions changes only during downtime, you could still mark those filegroups readonly during the day. Alternatively you can use TRANSACTION ISOLATION READ UNCOMMITTED or READ COMMITTED SNAPSHOTisolation. The later should speed things up on the "current" partition too, assuming you really do not ever update data. But even with updates it might help. Either way, make sure you test the performance in your environment. (In any event, do not use UNCOMMITTED in the active table/partition as you read queries might end up seeing half-written rows, depending on the data types you are using.)