Sql-server – What’s the best way to archive all but current year and partition the table at the same time

Architecturearchivesql serversql-server-2008-r2

Task

Archive off all but a rolling 13 month period from a group of large tables. The archived data must be stored in another database.

The database is in simple recovery mode
The tables are 50 mil rows to several billion and in some cases take up hundreds of gb each.
The tables are currently not partitioned
Each table has one clustered index on an ever increasing date column
Each table additionally has one non-clustered index
All data changes to the tables are inserts
The goal is to minimize downtime of the primary database.
Server is 2008 R2 Enterprise

The "archive" table will have about 1.1 billion rows, the "live" table about 400 million. Obviously the archive table will increase over time, but I expect the live table to increase reasonably quickly too. Say 50% in the next couple of years at least.

I had thought about Azure stretch databases but unfortunately we are at 2008 R2 and likely to stay there for a while.

Current Plan

Create a new database
Create new tables partitioned by month (using the modified date) in the new database.
Move the most recent 12-13 months of data into the partitioned tables.
Do a rename swap of the two databases
Delete the moved data from the now "archive" database.
Partition each of the tables in the "archive" database.
Use partition swaps to archive the data in the future.
- I do realize that I'll have to swap out the data to be archived, copy that table to the archive database, and then swap it into the archive table. This is acceptable.

Problem: I'm trying to move the data into the initial partitioned tables (in fact I'm still doing a proof of concept on it). I'm trying to use TF 610 (as per the Data Loading Performance Guide) and an INSERT...SELECT statement to move the data initially thinking it would be minimally logged. Unfortunately every time I try it's fully logged.

At this point I'm thinking my best bet may be to move the data using an SSIS package. I'm trying to avoid that since I'm working with 200 tables and anything I can do by script I can easily generate and run.

Is there anything I'm missing in my general plan, and is SSIS my best bet for moving the data quickly and with minimal use of the log (space concerns)?

Demo code without data

-- Existing structure
USE [Audit]
GO

CREATE TABLE [dbo].[AuditTable](
    [Col1] [bigint] NULL,
    [Col2] [int] NULL,
    [Col3] [int] NULL,
    [Col4] [int] NULL,
    [Col5] [int] NULL,
    [Col6] [money] NULL,
    [Modified] [datetime] NULL,
    [ModifiedBy] [varchar](50) NULL,
    [ModifiedType] [char](1) NULL
); 
-- ~1.4 bill rows, ~20% in the last year

CREATE CLUSTERED INDEX [AuditTable_Modified] ON [dbo].[AuditTable]
(   [Modified] ASC   )
GO


-- New DB & Code
USE Audit_New
GO

CREATE PARTITION FUNCTION ThirteenMonthPartFunction (datetime)
AS RANGE RIGHT FOR VALUES ('20150701', '20150801', '20150901', '20151001', '20151101', '20151201', 
                            '20160101', '20160201', '20160301', '20160401', '20160501', '20160601', 
                            '20160701') 

CREATE PARTITION SCHEME ThirteenMonthPartScheme AS PARTITION ThirteenMonthPartFunction
ALL TO ( [PRIMARY] );

CREATE TABLE [dbo].[AuditTable](
    [Col1] [bigint] NULL,
    [Col2] [int] NULL,
    [Col3] [int] NULL,
    [Col4] [int] NULL,
    [Col5] [int] NULL,
    [Col6] [money] NULL,
    [Modified] [datetime] NULL,
    [ModifiedBy] [varchar](50) NULL,
    [ModifiedType] [char](1) NULL
) ON ThirteenMonthPartScheme (Modified)
GO

CREATE CLUSTERED INDEX [AuditTable_Modified] ON [dbo].[AuditTable]
(
    [Modified] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON ThirteenMonthPartScheme (Modified)
GO

CREATE NONCLUSTERED INDEX [AuditTable_Col1_Col2_Col3_Col4_Modified] ON [dbo].[AuditTable]
(
    [Col1] ASC,
    [Col2] ASC,
    [Col3] ASC,
    [Col4] ASC,
    [Modified] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON ThirteenMonthPartScheme (Modified)
GO

Move code

USE Audit_New
GO
DBCC TRACEON(610);

INSERT INTO AuditTable
SELECT * FROM Audit.dbo.AuditTable
WHERE Modified >= '6/1/2015'
ORDER BY Modified

Best Answer

Why are you not getting minimal logging?

I have found the Data Loading Performance Guide, which you reference, to be an extremely valuable resource. However, it is also not 100% comprehensive, and I suspect that the grid is complex enough already that the author did not add a column Table Partitioning to break out differences in behavior depending on whether the table receiving the inserts is partitioned. As we will see later, the fact that table is already partitioned appears to inhibit minimal logging.

Recommended approach

Based upon the recommendations in the Data Loading Performance Guide (including the "Bulk Loading a Partitioned Table" section) as well as extensive experience loading partitioned tables with tens of billions of rows, here is the approach I would recommend:

Create a new database.
Create new tables partitioned by month in the new database.
Move the most recent year of data, in the following fashion:
- For each month, create a new heap table;
- Insert that month of data into the heap using the TABLOCK hint;
- Add the clustered index to the heap containing that month of data;
- Add the check constraint enforcing that the table contains just this month's data;
- Switch the table into the corresponding partition of the new overall partitioned table.
Do a rename swap of the two databases.
Truncate the data in the now "archive" database.
Partition each of the tables in the "archive" database.
Use partition swaps to archive the data in the future.

The differences when compared to your original approach:

The methodology of moving the recent 12-13 months of data will be much more efficient if you load into a heap with TABLOCK one month at a time, using partition switching to place the data into the partitioned table.
A DELETE to clear away old table will be fully logged. Perhaps you can either TRUNCATE or drop the table and create a new archive table.

Comparison of approaches for moving the recent year of data

In order to compare approaches in a reasonable amount of time on my machine, I used a 100MM row test data set that I generated and that follows your schema.

As you can see from the results below, there is a large performance boost and reduction in log writes by loading data into a heap using the TABLOCK hint. There is an additional benefit if this is done one partition at a time. It's also worth noting that the one-partition-at-a-time method can easily be parallelized further if you run multiple partitions at once. Depending on your hardware, that might yield a nice boost; we typically load at least four partitions at once on server-class hardware.

Here is the full test script.

Final notes

All of these results depend on your hardware to some degree. However, my tests were conducted on a standard quad-core laptop with spinning disk drive. It's likely that the data loads should be much faster if you are using a decent server that does not have a lot of other load at the time you are conducting this process.

For example, I ran the recommended approach on an actual dev server (Dell R720) and saw a reduction to 76 seconds (from 156 seconds on my laptop). Interestingly, the original approach of inserting into a partitioned table did not experience the same improvement and still took just over 12 minutes on the dev server. Presumably this is because this pattern yields a serial execution plan, and a single processor on my laptop can match a single processor on the dev server.

Related Solutions

Sql-server – the most efficient way to get the minimum of multiple columns on SQL Server 2005

I tested the performance of all 3 methods, and here's what I found:

1 record: No noticeable difference
10 records: No noticeable difference
1,000 records: No noticeable difference
10,000 records: UNION subquery was a little slower. The CASE WHEN query is a little faster than the UNPIVOT one.
100,000 records: UNION subquery is significantly slower, but UNPIVOT query becomes a little faster than the CASE WHEN query
500,000 records: UNION subquery still significantly slower, but UNPIVOT becomes much faster than the CASE WHEN query

So the end results seems to be

With smaller record sets there doesn't seem to be enough of a difference to matter. Use whatever is easiest to read and maintain.
Once you start getting into larger record sets, the UNION ALL subquery begins to perform poorly compared to the other two methods.
The CASE statement performs the best up until a certain point (in my case, around 100k rows), and which point the UNPIVOT query becomes the best-performing query

The actual number at which one query becomes better than another will probably change as a result of your hardware, database schema, data, and current server load, so be sure to test with your own system if you're concerned about performance.

I also ran some tests using Mikael's answer; however, it was slower than all 3 of the other methods tried here for most recordset sizes. The only exception was it did better than a the UNION ALL query for very large recordset sizes. I like the fact it shows the column name in addition to the smallest value though.

I'm not a dba, so I may not have optimized my tests and missed something. I was testing with the actual live data, so that may have affected the results. I tried to account for that by running each query a few different times, but you never know. I would definitely be interested if someone wrote up a clean test of this and shared their results.

Sql-server – Parent-Child Tree Hierarchical ORDER

OK, enough brain cells are dead.

SQL Fiddle

WITH cte AS
(
  SELECT 
    [ICFilterID], 
    [ParentID],
    [FilterDesc],
    [Active],
    CAST(0 AS varbinary(max)) AS Level
  FROM [dbo].[ICFilters]
  WHERE [ParentID] = 0
  UNION ALL
  SELECT 
    i.[ICFilterID], 
    i.[ParentID],
    i.[FilterDesc],
    i.[Active],  
    Level + CAST(i.[ICFilterID] AS varbinary(max)) AS Level
  FROM [dbo].[ICFilters] i
  INNER JOIN cte c
    ON c.[ICFilterID] = i.[ParentID]
)

SELECT 
  [ICFilterID], 
  [ParentID],
  [FilterDesc],
  [Active]
FROM cte
ORDER BY [Level];