We have this large database (>1TB) that we intend to "shrink". The database revolves around one main entity, let's call it "Visit". For discussion, let's say it is a database for a medical practice.
There are a total of 30 visit "types", such as procedure, annual, follow-up, immunisation etc, each of which is a subsidary table to "Visit", e.g. "visit_immuno".
The database has accummulated some 12 years of data since 2000. Someone has proposed that we keep about 3 years of data in the "live" version and have the rest live in an "old_data" database. The date is ONLY stored in the "Visit" table since it is normalised. The Visit table also contains a ROWVERSION
column and a BIGINT
pseudo-identity (clustered) column. For all intents and purposes, let's say the clustering key is populated by a SEQUENCE (SQL Server 2012 Enterprise) – we shall name it cid
.
The visit.date
is not always in the same order as the clustering key, for example when a doctor goes on extended visitations and returns with his "briefcase" of data, it gets merged into the main table. There are also some updates to the "visit" table that will cause the ROWVERSION
column to be out of sync with both the cid
and date
columns – to put it simply, neither ROWVERSION
nor cid
would make suitable partition keys for this reason.
The business rule for removing data from the "live" is that the visit.date
must be greater than 36 months and a child visit_payment
record must exist. Also, the "old_data" database does not contain any of the base tables except visit%
.
So we end up with:
Live DB (daily use) – All tables
Old-Data DB – older data for the visit%
tables
The proposal calls for a Combined DB that is a shell containing Synonyms to ALL the base tables in the Live DB
(except visit%
) plus Views that UNION ALL across the visit%
tables in the two databases.
Assuming the same indexes are created in the Old-Data
DB, will the queries perform well on the UNION-ALL Views? What type of query patterns might trip up the execution plan for the UNION-ALL Views?
Best Answer
For convenience, assume that the live database is called
LiveDb
and the achive database is calledArchiveDb
LiveDb
pointing to the tables inArchiveDb
database via a synonym (There is no need to do a combined db with synonyms)visit.date
and denormalise this column tovisit_payments
too if it isn't there already (this improves co-located join performance)LiveDb
so all joins to the smaller tables are kept localLiveDb
andArchiveDb
that describes the range ofvisit.date
contained in the table. This helps the optimiser eliminate the archive table from both seeks and scans that contain the columnvisit.data
. You will have to periodically update this constraint.visit.data
. This is in addition to the hint you already provided in the check constraint. This maximises the chance of filters being pushed downAchiveDb
in SIMPLE recovery mode if it isn't already. You are not likely to need transaction log backups ofArchiveDb
LiveDb
andArchiveDb
All of the above does not guarantee that the optimiser will eliminate the archive tables from seeks and scans, but it makes it more likely.
When the elimination doesn't happen. These are the effect you may see (this list may be incomplete). For seeks, you will get an additional seek on every query (this drives up IOPS). For scans, the results could be disastrous for performance as you may end up scanning both the archive and live tables. Here are the typical ways you can trip up the optimiser:
visit%
tables together and don't include thevisit.data
in the join criteria (this is why you want to denormalise). Because of this, you may wish to modify some of your queriesvisit.data
and another table (for example a date dimension), you may not get the right elimination of tablesvisit.data
, for example a seek directly on the key of the view.For the last scenario, you can guard yourself against the worst effects by adding another check constraint on the
cid
- if this is possible. You did mention that the sequence ofcid
not "clean" with respect to the dates and progression of rows in the table. However, could you maintain a table that contains the information: "There are nocid
above this number since thisvisit.data
" or similar? This could then drive an additional constraint.Another thing to be careful about is that parallel queries may spawn a LOT more threads once you query the partitioned view (as both "sub-tables" will be exposed to the same parallel optimisations). For that reasons, you may want to limit MAXDOP on the server or the queries that are parallel.
By the way, if you know the queries well - you may not even need the same indexes in the two databases (this assumes you are 100% sure you will get the right elimination of tables). You could even consider using column stores for
ArchiveDb
.