Data archival strategy that handles schema changes

archivebest practices

I am working with a legacy application that has about ten years of customer data. A majority of this data is not used in day-to-day operations but there is a business requirement to have the data available for the customer until his retirement from the system.

We are exploring archiving the data to a copy of the existing database and then purging the records from production past a certain point in time.

My concern is that the database undergoes a substantial schema change every quarter due to development efforts.

If I were to archive a mirror copy of the data, would I also need to apply every single change script that goes against production?

Are there any alternative strategies? It seems like no matter what form of storage you choose (i.e. database, flat files, xml), you will always need some way of mapping older schemas to newer ones.

Best Answer

You need to define your requirements more specifically before even thinking about a solution:

  • Why is archiving necessary? It sounds like the system already handles old data, so what is the business need to separate out this data? Performance?

  • Is archive data a read-only snapshot, or are historical data changes possible? If changes are possible, which types of changes will be supported (insert, update, delete, or some combination of those)?

  • Is the database multi-tenant, and if so, do you require the ability for each tenant to be able to archive at different points in time?

  • Does your application need to run with the archive data as a data source? I'm assuming yes since you mention synchronizing schema changes.

  • What is the minimum version/edition of the DBMS you need to support? This will determine which features are available to use in your strategy.

  • How much time do you have to implement archiving? Archiving is a very low-level design issue that ideally should be baked in right from the start; adding it on later can potentially take a significant amount of time to redesign and implement.


Now having said all that, one thing I would advise you, having done it here (SQL Server), is this: avoid multiple databases if at all possible, particularly if you require the ability to edit historical data.

  • While I can't give away our IP, I'll tell you that the process involved with shuffling data from the "live" database to the "archive" database is extremely complex if you take a dynamic approach like we did for ~700 tables. Depending on the state of your database schema, that kind of thing may not even be possible to accomplish, or result in data discrepancies you don't expect. If you don't have very many tables (< 200) and the schema is in rough shape, quite honestly, do not take a dynamic approach, or wait until it's cleaned up to a decent state. With a large number of tables and a rough schema, multiple databases is not a viable solution.

  • As you mentioned, you do have to run update scripts against multiple databases, and you have to keep track of all the databases somehow. It's really easy for things to become desynchronized, i.e., an archive database is moved to a different server or instance and your configuration database or table holds the old information. The choice is either to always synchronize the schema, or write applications to always be backwards compatible. (Hint: it's far easier to synchronize the schema.)

While I certainly don't recommend multiple databases, it is a possible solution depending on your requirements.