For convenience, assume that the live database is called LiveDb
and the achive database is called ArchiveDb
- Add a UNION ALL view in
LiveDb
pointing to the tables in ArchiveDb
database via a synonym (There is no need to do a combined db with synonyms)
- "Partition" on
visit.date
and denormalise this column to visit_payments
too if it isn't there already (this improves co-located join performance)
- Only archive the two large tables if possible (reduces chance of tripping up the optimiser). Keep the UNION ALL view and the other tables in
LiveDb
so all joins to the smaller tables are kept local
- Add a CHECK constraint on the tables in both
LiveDb
and ArchiveDb
that describes the range of visit.date
contained in the table. This helps the optimiser eliminate the archive table from both seeks and scans that contain the column visit.data
. You will have to periodically update this constraint.
- In the UNION ALL view, add a WHERE criteria which filters on
visit.data
. This is in addition to the hint you already provided in the check constraint. This maximises the chance of filters being pushed down
- If you have EE, partition the table in the archive database (But NOT in the live database). If you want to get really fancy, use filegroup level backup/restore of the archive databases to save on backup times.
- Consider putting
AchiveDb
in SIMPLE recovery mode if it isn't already. You are not likely to need transaction log backups of ArchiveDb
- Use INSERT... WITH (TABLOCK) SELECT ... WITH (ROWLOCK) to force minimal logging on the destination when moving data between
LiveDb
and ArchiveDb
All of the above does not guarantee that the optimiser will eliminate the archive tables from seeks and scans, but it makes it more likely.
When the elimination doesn't happen. These are the effect you may see (this list may be incomplete). For seeks, you will get an additional seek on every query (this drives up IOPS). For scans, the results could be disastrous for performance as you may end up scanning both the archive and live tables. Here are the typical ways you can trip up the optimiser:
- If you join the
visit%
tables together and don't include the visit.data
in the join criteria (this is why you want to denormalise). Because of this, you may wish to modify some of your queries
- If you get a hash join between
visit.data
and another table (for example a date dimension), you may not get the right elimination of tables
- If you try to aggregate data over the archived tables
- If you filter on anything BUT the
visit.data
, for example a seek directly on the key of the view.
For the last scenario, you can guard yourself against the worst effects by adding another check constraint on the cid
- if this is possible. You did mention that the sequence of cid
not "clean" with respect to the dates and progression of rows in the table. However, could you maintain a table that contains the information: "There are no cid
above this number since this visit.data
" or similar? This could then drive an additional constraint.
Another thing to be careful about is that parallel queries may spawn a LOT more threads once you query the partitioned view (as both "sub-tables" will be exposed to the same parallel optimisations). For that reasons, you may want to limit MAXDOP on the server or the queries that are parallel.
By the way, if you know the queries well - you may not even need the same indexes in the two databases (this assumes you are 100% sure you will get the right elimination of tables). You could even consider using column stores for ArchiveDb
.
Back in 2007, I asked for an easy way to generate a CREATE TABLE
script via T-SQL rather than using the UI or SMO. I was summarily rejected.
However, SQL Server 2012 makes this very easy. Let's pretend we have a table with the same schema across multiple databases, e.g. dbo.whatcha
:
CREATE TABLE dbo.whatcha
(
id INT IDENTITY(1,1),
x VARCHAR(MAX),
b DECIMAL(10,2),
y SYSNAME
);
The following script uses the new sys.dm_exec_describe_first_results_set
dynamic management function to retrieve the proper data types for each of the columns (and ignoring the IDENTITY
property). It builds the #tmp table you need, inserts from each of the databases in your list, and then selects from #tmp, all within a single dynamic SQL batch and without using a WHILE
loop (that doesn't make it better, just simpler to look at and allows you to ignore Database_Ref_No
entirely :-)).
SET NOCOUNT ON;
DECLARE @sql NVARCHAR(MAX), @cols NVARCHAR(MAX) = N'';
SELECT @cols += N',' + name + ' ' + system_type_name
FROM sys.dm_exec_describe_first_result_set(N'SELECT * FROM dbo.whatcha', NULL, 1);
SET @cols = STUFF(@cols, 1, 1, N'');
SET @sql = N'CREATE TABLE #tmp(' + @cols + ');'
DECLARE @dbs TABLE(db SYSNAME);
INSERT @dbs VALUES(N'db1'),(N'db2');
-- SELECT whatever FROM dbo.databases
SELECT @sql += N'
INSERT #tmp SELECT ' + @cols + ' FROM ' + QUOTENAME(db) + '.dbo.tablename;'
FROM @dbs;
SET @sql += N'
SELECT ' + @cols + ' FROM #tmp;';
PRINT @sql;
-- EXEC sp_executesql @sql;
The resulting PRINT
output:
CREATE TABLE #tmp(id int,x varchar(max),b decimal(10,2),y nvarchar(128));
INSERT #tmp SELECT id,x,b,y FROM [db1].dbo.tablename;
INSERT #tmp SELECT id,x,b,y FROM [db2].dbo.tablename;
SELECT id,x,b,y FROM #tmp;
When you are confident it's doing what you expect, just uncomment the EXEC
.
(This trusts you that the schema is the same; it does not validate that one or more of the tables has since been changed, and may fail as a result.)
Best Answer
Since you only have to do it once, forget the fancy batch job and just get on with each task. Assuming you know the table names, you just do this sort of thing.
For each table with foreign keys, drop the foreign key constraint on that field. Then depending on your requirement, you can add records to the table that should have the data and add a new foreign key, or, simply delete those records. Then you can drop the table.
Much of this work can be done with SSMS so you might not have to write that many queries.