Sql-server – SQL Server database design for “archived but available” data

database-designsql serversql-server-2012

We have this large database (>1TB) that we intend to "shrink". The database revolves around one main entity, let's call it "Visit". For discussion, let's say it is a database for a medical practice.

There are a total of 30 visit "types", such as procedure, annual, follow-up, immunisation etc, each of which is a subsidary table to "Visit", e.g. "visit_immuno".

The database has accummulated some 12 years of data since 2000. Someone has proposed that we keep about 3 years of data in the "live" version and have the rest live in an "old_data" database. The date is ONLY stored in the "Visit" table since it is normalised. The Visit table also contains a ROWVERSION column and a BIGINT pseudo-identity (clustered) column. For all intents and purposes, let's say the clustering key is populated by a SEQUENCE (SQL Server 2012 Enterprise) – we shall name it cid.

The visit.date is not always in the same order as the clustering key, for example when a doctor goes on extended visitations and returns with his "briefcase" of data, it gets merged into the main table. There are also some updates to the "visit" table that will cause the ROWVERSION column to be out of sync with both the cid and date columns – to put it simply, neither ROWVERSION nor cid would make suitable partition keys for this reason.

The business rule for removing data from the "live" is that the visit.date must be greater than 36 months and a child visit_payment record must exist. Also, the "old_data" database does not contain any of the base tables except visit%.

So we end up with:

Live DB (daily use) – All tables
Old-Data DB – older data for the visit% tables

The proposal calls for a Combined DB that is a shell containing Synonyms to ALL the base tables in the Live DB (except visit%) plus Views that UNION ALL across the visit% tables in the two databases.

Assuming the same indexes are created in the Old-Data DB, will the queries perform well on the UNION-ALL Views? What type of query patterns might trip up the execution plan for the UNION-ALL Views?

Best Answer

For convenience, assume that the live database is called LiveDb and the achive database is called ArchiveDb

  • Add a UNION ALL view in LiveDb pointing to the tables in ArchiveDb database via a synonym (There is no need to do a combined db with synonyms)
  • "Partition" on visit.date and denormalise this column to visit_payments too if it isn't there already (this improves co-located join performance)
  • Only archive the two large tables if possible (reduces chance of tripping up the optimiser). Keep the UNION ALL view and the other tables in LiveDb so all joins to the smaller tables are kept local
  • Add a CHECK constraint on the tables in both LiveDb and ArchiveDb that describes the range of visit.date contained in the table. This helps the optimiser eliminate the archive table from both seeks and scans that contain the column visit.data. You will have to periodically update this constraint.
  • In the UNION ALL view, add a WHERE criteria which filters on visit.data. This is in addition to the hint you already provided in the check constraint. This maximises the chance of filters being pushed down
  • If you have EE, partition the table in the archive database (But NOT in the live database). If you want to get really fancy, use filegroup level backup/restore of the archive databases to save on backup times.
  • Consider putting AchiveDb in SIMPLE recovery mode if it isn't already. You are not likely to need transaction log backups of ArchiveDb
  • Use INSERT... WITH (TABLOCK) SELECT ... WITH (ROWLOCK) to force minimal logging on the destination when moving data between LiveDb and ArchiveDb

All of the above does not guarantee that the optimiser will eliminate the archive tables from seeks and scans, but it makes it more likely.

When the elimination doesn't happen. These are the effect you may see (this list may be incomplete). For seeks, you will get an additional seek on every query (this drives up IOPS). For scans, the results could be disastrous for performance as you may end up scanning both the archive and live tables. Here are the typical ways you can trip up the optimiser:

  • If you join the visit% tables together and don't include the visit.data in the join criteria (this is why you want to denormalise). Because of this, you may wish to modify some of your queries
  • If you get a hash join between visit.data and another table (for example a date dimension), you may not get the right elimination of tables
  • If you try to aggregate data over the archived tables
  • If you filter on anything BUT the visit.data, for example a seek directly on the key of the view.

For the last scenario, you can guard yourself against the worst effects by adding another check constraint on the cid - if this is possible. You did mention that the sequence of cid not "clean" with respect to the dates and progression of rows in the table. However, could you maintain a table that contains the information: "There are no cid above this number since this visit.data" or similar? This could then drive an additional constraint.

Another thing to be careful about is that parallel queries may spawn a LOT more threads once you query the partitioned view (as both "sub-tables" will be exposed to the same parallel optimisations). For that reasons, you may want to limit MAXDOP on the server or the queries that are parallel.

By the way, if you know the queries well - you may not even need the same indexes in the two databases (this assumes you are 100% sure you will get the right elimination of tables). You could even consider using column stores for ArchiveDb.