Sql-server – Splitting a large table to improve performance

best practicesperformancesql server

This is a follow-up to an earlier question. I have a SQL Server 2008 R2 Standard server, that holds a single database, which itself has almost nothing except a large table.

The table is 100+ million rows (35 columns) and growing at around 250,000 rows per day. We need all the data to be "online", and most of the columns need to be searchable in some fashion. The vast majority of activity on the table is reading; apart from the new data being INSERTed during the day, there's no need to change anything.

Users perform a range of queries on the table, ranging from simple look-up-a-record requests to pulling tens of thousands of rows based on a range of criteria. We only have limited control on the queries that are run, and performance is starting to suffer, even with indexing.

A big part of the problem is disk I/O, which we're addressing by retrofitting a SSD-based array. As all database files will be on this new array, the consensus is that having multiple database files won't make any difference, but that splitting the table up into separate tables might be the way to go.

I'm now puzzling over what would be the best approach to this. Two ideas that I'm debating with myself:

Split the table into "tiers"
- A table containing the last week's data, which is the one being
  INSERTed into each day
- Next table containing from last week
  back to 3 months previous
- Next table containing from 3 months to 6 months
- Next table containing anything older than 6 months
I'd then "shuffle" the data down the tiers overnight (the database
is only accessed 8am-10pm, so I have a window overnight to process
data).
Create tables for data ranges
- Create a table for a data range – say per quarter. I'd then have
  the data INSERTing into the table 2Q2013, and then trip over to
  3Q2013, 4Q2014 etc …
I could use filegroups to make older tables "read only" if this
would improve performance.

Option 1 is the easiest for me to implement, but I'm not sure if this is a completely mad idea. Option 2 is a more work to implement and maintain, but if it's "best practice" for this kind of problem that it's the way I'll go.

Any and all advice or alternative ideas would be gratefully received – I'm away that these kinds of problems are best solved at design-time.

Best Answer

I personally would go with your first option. If you use a `DELETE dbo.p1 OUTPUT INTO dbo.p2' pattern to move the data there is not a lot that can fail. Moving 250K x 3 rows that way in the timeframe you have should not be a problem either if you do it in batches with around 10K rows.

The advantage I see over the more common calendar based partitioning is that your "partitions" stay the same size. With the amount of data you are dealing with a calendar base partitioning approach would work very well in the beginning of the month and potentially significantly slower at the end of the month.

I have written an article together with Kalen Delany about the advantages of the "moving data from partition to partition as it ages" approach a while back in sqlmag: http://sqlmag.com/database-administration/using-table-partitions-archive-old-data-oltp-environments

That article is using the enterprise-only build-in partitioning feature, but you can implement it using handmade multi-table-partitioning too.

I would try to place the different tables (partitions) on separate drives. Because your data in the older partitions changes only during downtime, you could still mark those filegroups readonly during the day. Alternatively you can use TRANSACTION ISOLATION READ UNCOMMITTED or READ COMMITTED SNAPSHOTisolation. The later should speed things up on the "current" partition too, assuming you really do not ever update data. But even with updates it might help. Either way, make sure you test the performance in your environment. (In any event, do not use UNCOMMITTED in the active table/partition as you read queries might end up seeing half-written rows, depending on the data types you are using.)

Related Solutions

What are the pitfalls when storing data retrieved from an api

Advice:

Normalize data stored.
Consider carefully and figure out your write model. For orders, can you overwrite? do you need to mark superceded and append?

The above are generic data entry pieces of advice, but regarding automated application to application interfaces, you have the issue of erroneous data entry. To deal with this it is probably better to accept orders etc. but have the invoicing done by humans, who can cross check with the real world. Also this keeps people in control of your books. So I would add to that:

Separate automated entry from homan review and approval.

Sql-server – Inserting data to a table which is being cleaned up by SQL job at the same time

If CreatedOn is not the clustered index (and especially if it is not indexed at all), and you are inserting new data sequentially (e.g. new rows have a newer CreatedOn), you should consider making that the clustered index. Then you shouldn't have any contention when you are deleting old data and inserting new data, since it will be on a completely different set of pages. Unless for some reason your inserts or the delete are escalating locks. If that is the case then instead of one big transaction:

DELETE [massive amount of rows]
  WHERE CreatedOn < [somedate];

You should consider breaking that up into chunks, as you suggested; but, just doing that in a loop and not adding any other changes won't have that much of an effect, since that may still be operating as a single transaction (and your problem might be exasperated by transaction log writes, and perhaps poor autogrowth settings for the transaction log). My method is typically as follows (with no "outer" transaction at play) - commit each set in its own transaction, and checkpoint or backup the log in between transactions. This lets user queries get in between your transactions and also reduces the impact on the log. Note that I picked 1000 arbitrarily; that might not be the right number for your scenario, but I'm pretty sure 10 is not.

DECLARE @somedate DATE = DATEADD(DAY, -1, GETUTCDATE());

BEGIN TRANSACTION;

SELECT 1;

WHILE @@ROWCOUNT > 0
BEGIN
  COMMIT TRANSACTION;

  -- if in simple recovery: CHECKPOINT;
  -- otherwise: BACKUP LOG ...;

  BEGIN TRANSACTION;

  DELETE TOP (1000) dbo.MyTable
    WHERE CreatedOn < @SomeDate;
END

But again, this will help your process most if CreatedOn is clustered, or at least indexed (having no context I don't know if your overall workload would be better off with that as clustered, but I do know that this query will work much better if it is).

Best Answer

Related Solutions

What are the pitfalls when storing data retrieved from an api

Sql-server – Inserting data to a table which is being cleaned up by SQL job at the same time

Related Question