Sql-server – What’s the best way to archive all but current year and partition the table at the same time

Architecturearchivesql serversql-server-2008-r2

Task

Archive off all but a rolling 13 month period from a group of large tables. The archived data must be stored in another database.

  • The database is in simple recovery mode
  • The tables are 50 mil rows to several billion and in some cases take up hundreds of gb each.
  • The tables are currently not partitioned
  • Each table has one clustered index on an ever increasing date column
  • Each table additionally has one non-clustered index
  • All data changes to the tables are inserts
  • The goal is to minimize downtime of the primary database.
  • Server is 2008 R2 Enterprise

The "archive" table will have about 1.1 billion rows, the "live" table about 400 million. Obviously the archive table will increase over time, but I expect the live table to increase reasonably quickly too. Say 50% in the next couple of years at least.

I had thought about Azure stretch databases but unfortunately we are at 2008 R2 and likely to stay there for a while.

Current Plan

  • Create a new database
  • Create new tables partitioned by month (using the modified date) in the new database.
  • Move the most recent 12-13 months of data into the partitioned tables.
  • Do a rename swap of the two databases
  • Delete the moved data from the now "archive" database.
  • Partition each of the tables in the "archive" database.
  • Use partition swaps to archive the data in the future.
    • I do realize that I'll have to swap out the data to be archived, copy that table to the archive database, and then swap it into the archive table. This is acceptable.

Problem: I'm trying to move the data into the initial partitioned tables (in fact I'm still doing a proof of concept on it). I'm trying to use TF 610 (as per the Data Loading Performance Guide) and an INSERT...SELECT statement to move the data initially thinking it would be minimally logged. Unfortunately every time I try it's fully logged.

At this point I'm thinking my best bet may be to move the data using an SSIS package. I'm trying to avoid that since I'm working with 200 tables and anything I can do by script I can easily generate and run.

Is there anything I'm missing in my general plan, and is SSIS my best bet for moving the data quickly and with minimal use of the log (space concerns)?

Demo code without data

-- Existing structure
USE [Audit]
GO

CREATE TABLE [dbo].[AuditTable](
    [Col1] [bigint] NULL,
    [Col2] [int] NULL,
    [Col3] [int] NULL,
    [Col4] [int] NULL,
    [Col5] [int] NULL,
    [Col6] [money] NULL,
    [Modified] [datetime] NULL,
    [ModifiedBy] [varchar](50) NULL,
    [ModifiedType] [char](1) NULL
); 
-- ~1.4 bill rows, ~20% in the last year

CREATE CLUSTERED INDEX [AuditTable_Modified] ON [dbo].[AuditTable]
(   [Modified] ASC   )
GO


-- New DB & Code
USE Audit_New
GO

CREATE PARTITION FUNCTION ThirteenMonthPartFunction (datetime)
AS RANGE RIGHT FOR VALUES ('20150701', '20150801', '20150901', '20151001', '20151101', '20151201', 
                            '20160101', '20160201', '20160301', '20160401', '20160501', '20160601', 
                            '20160701') 

CREATE PARTITION SCHEME ThirteenMonthPartScheme AS PARTITION ThirteenMonthPartFunction
ALL TO ( [PRIMARY] );

CREATE TABLE [dbo].[AuditTable](
    [Col1] [bigint] NULL,
    [Col2] [int] NULL,
    [Col3] [int] NULL,
    [Col4] [int] NULL,
    [Col5] [int] NULL,
    [Col6] [money] NULL,
    [Modified] [datetime] NULL,
    [ModifiedBy] [varchar](50) NULL,
    [ModifiedType] [char](1) NULL
) ON ThirteenMonthPartScheme (Modified)
GO

CREATE CLUSTERED INDEX [AuditTable_Modified] ON [dbo].[AuditTable]
(
    [Modified] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON ThirteenMonthPartScheme (Modified)
GO

CREATE NONCLUSTERED INDEX [AuditTable_Col1_Col2_Col3_Col4_Modified] ON [dbo].[AuditTable]
(
    [Col1] ASC,
    [Col2] ASC,
    [Col3] ASC,
    [Col4] ASC,
    [Modified] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON ThirteenMonthPartScheme (Modified)
GO

Move code

USE Audit_New
GO
DBCC TRACEON(610);

INSERT INTO AuditTable
SELECT * FROM Audit.dbo.AuditTable
WHERE Modified >= '6/1/2015'
ORDER BY Modified

Best Answer

Why are you not getting minimal logging?

I have found the Data Loading Performance Guide, which you reference, to be an extremely valuable resource. However, it is also not 100% comprehensive, and I suspect that the grid is complex enough already that the author did not add a column Table Partitioning to break out differences in behavior depending on whether the table receiving the inserts is partitioned. As we will see later, the fact that table is already partitioned appears to inhibit minimal logging.

enter image description here

Recommended approach

Based upon the recommendations in the Data Loading Performance Guide (including the "Bulk Loading a Partitioned Table" section) as well as extensive experience loading partitioned tables with tens of billions of rows, here is the approach I would recommend:

  • Create a new database.
  • Create new tables partitioned by month in the new database.
  • Move the most recent year of data, in the following fashion:
    • For each month, create a new heap table;
    • Insert that month of data into the heap using the TABLOCK hint;
    • Add the clustered index to the heap containing that month of data;
    • Add the check constraint enforcing that the table contains just this month's data;
    • Switch the table into the corresponding partition of the new overall partitioned table.
  • Do a rename swap of the two databases.
  • Truncate the data in the now "archive" database.
  • Partition each of the tables in the "archive" database.
  • Use partition swaps to archive the data in the future.

The differences when compared to your original approach:

  • The methodology of moving the recent 12-13 months of data will be much more efficient if you load into a heap with TABLOCK one month at a time, using partition switching to place the data into the partitioned table.
  • A DELETE to clear away old table will be fully logged. Perhaps you can either TRUNCATE or drop the table and create a new archive table.

Comparison of approaches for moving the recent year of data

In order to compare approaches in a reasonable amount of time on my machine, I used a 100MM row test data set that I generated and that follows your schema.

As you can see from the results below, there is a large performance boost and reduction in log writes by loading data into a heap using the TABLOCK hint. There is an additional benefit if this is done one partition at a time. It's also worth noting that the one-partition-at-a-time method can easily be parallelized further if you run multiple partitions at once. Depending on your hardware, that might yield a nice boost; we typically load at least four partitions at once on server-class hardware.

enter image description here

Here is the full test script.

Final notes

All of these results depend on your hardware to some degree. However, my tests were conducted on a standard quad-core laptop with spinning disk drive. It's likely that the data loads should be much faster if you are using a decent server that does not have a lot of other load at the time you are conducting this process.

For example, I ran the recommended approach on an actual dev server (Dell R720) and saw a reduction to 76 seconds (from 156 seconds on my laptop). Interestingly, the original approach of inserting into a partitioned table did not experience the same improvement and still took just over 12 minutes on the dev server. Presumably this is because this pattern yields a serial execution plan, and a single processor on my laptop can match a single processor on the dev server.