Data archival strategy that handles schema changes

archivebest practices

I am working with a legacy application that has about ten years of customer data. A majority of this data is not used in day-to-day operations but there is a business requirement to have the data available for the customer until his retirement from the system.

We are exploring archiving the data to a copy of the existing database and then purging the records from production past a certain point in time.

My concern is that the database undergoes a substantial schema change every quarter due to development efforts.

If I were to archive a mirror copy of the data, would I also need to apply every single change script that goes against production?

Are there any alternative strategies? It seems like no matter what form of storage you choose (i.e. database, flat files, xml), you will always need some way of mapping older schemas to newer ones.

Best Answer

You need to define your requirements more specifically before even thinking about a solution:

Why is archiving necessary? It sounds like the system already handles old data, so what is the business need to separate out this data? Performance?
Is archive data a read-only snapshot, or are historical data changes possible? If changes are possible, which types of changes will be supported (insert, update, delete, or some combination of those)?
Is the database multi-tenant, and if so, do you require the ability for each tenant to be able to archive at different points in time?
Does your application need to run with the archive data as a data source? I'm assuming yes since you mention synchronizing schema changes.
What is the minimum version/edition of the DBMS you need to support? This will determine which features are available to use in your strategy.
How much time do you have to implement archiving? Archiving is a very low-level design issue that ideally should be baked in right from the start; adding it on later can potentially take a significant amount of time to redesign and implement.

Now having said all that, one thing I would advise you, having done it here (SQL Server), is this: avoid multiple databases if at all possible, particularly if you require the ability to edit historical data.

While I can't give away our IP, I'll tell you that the process involved with shuffling data from the "live" database to the "archive" database is extremely complex if you take a dynamic approach like we did for ~700 tables. Depending on the state of your database schema, that kind of thing may not even be possible to accomplish, or result in data discrepancies you don't expect. If you don't have very many tables (< 200) and the schema is in rough shape, quite honestly, do not take a dynamic approach, or wait until it's cleaned up to a decent state. With a large number of tables and a rough schema, multiple databases is not a viable solution.
As you mentioned, you do have to run update scripts against multiple databases, and you have to keep track of all the databases somehow. It's really easy for things to become desynchronized, i.e., an archive database is moved to a different server or instance and your configuration database or table holds the old information. The choice is either to always synchronize the schema, or write applications to always be backwards compatible. (Hint: it's far easier to synchronize the schema.)

While I certainly don't recommend multiple databases, it is a possible solution depending on your requirements.

Related Solutions

PostgreSQL – Best Practices for Schema Changes and Data Migrations Without Downtime

You almost have your answer already:

Create the new structure in parallel
Start writing to both structures
Migrate old data to the new structure
Only write and read new structure
Delete old columns

As for step 3, use something like this (in one transaction):

Insert what is not there yet:

INSERT INTO new_tbl (old_id, data)
SELECT old_id, data
FROM   old_tbl
WHERE  NOT EXISTS (SELECT * FROM new_tbl WHERE new_tbl.old_id = old_tbl.old_id);

Update what has changed in the meantime:

UPDATE new_tbl
SET    data  = old.data
USING  old_tbl
WHERE  new_tbl.old_id = old_tbl.old_id
AND    new_tbl.data IS DISTINCT FROM old_tbl.data;

New data will not be touched, because it is identical in both places.

Sql-server – Optimizing queries against a set of tables that contain partitioned data

Before table partitioning was introduced the way of doing the very thing you asked was called Partitioned Views:

A partitioned view joins horizontally partitioned data from a set of member tables across one or more servers, making the data appear as if from one table.

Partitioned Views are not only a UNION ALL over each table. By using appropriate check constraints on the underlying tables partitioning column the optimizer can do some fancy stuff, including partition elimination.

PS. Obviously you can generate the view pragmatically. You can even use Event Notifications to generate it automatically every time your 3rd party tool creates a new table, assuming you're running on SQL Server 2005 or later.

Best Answer

Related Solutions

PostgreSQL – Best Practices for Schema Changes and Data Migrations Without Downtime

Sql-server – Optimizing queries against a set of tables that contain partitioned data

Related Question