SQL Server – 24×7 vs Night Time Window

sql server

Where can I find resources on how to better move to a 24×7 operation? How do the big companies with big databases accomplish this? Our nightly jobs such as

  1. purge old data
  2. reindex
  3. update statistics

all seem to cause critical impact to our system (i.e. online users and real time data feeds). I've looked on Amazon for any book related to this subject, and so far haven't found anything.

Best Answer

Maintaining 24x7 databases is a fairly large topic with lots of options to consider. This broad topic has many items to consider, but we can try and touch base on some of the high points.

What you first will want to identify is, while many operations are 24x7, there are usually times of low activity. You can leverage these times for running your maintenance so you reduce the interference you will have on the database. The second is you will need to reserve some time for complete outages (for things like service packs or database migrations), so you will need to negotiate full maintenance windows with your management. For specific items you will need to consider and plan for each one, as well as leverage your tools appropriately. The important piece is you must PLAN each of these, any examples I provide are very much "your miles may vary".

Backups

Backups commonly won't have a huge impact on workloads, but must be accounted for as they can consume a lot of I/O. You will want to schedule these appropriately and monitor the amount of time it takes to complete. The biggest hurdle here is that in a 24x7 operation, you will likely not be able to conduct full nightly backups every night of the week. You will want to plan out when you can take fulls, when you are taking differentials, and retention periods for both of these in combination with your log backups.

As an example, I run full backups of all my databases on Sunday night (lowest activity), differentials on all other nights (Monday-Saturday). I keep the last two weeks of fulls and diffs on disk, logs for the last two days. This gives me enough flexibility for recovery, but I might have to recover backups from tape if necessary.

Index/Statistics Maintenance

This is the most common type of active maintenance you will have to deal with. You can't avoid it, but you can mitigate the impact. The initial rule of thumb is that you should only do maintenance on objects that need it. General guidelines are to only rebuild indexes that are greater than 30% fragmented and larger than 1000 pages. If you have auto-update statistics, this will handle most of your statistics maintenance, but a nightly job to keep things in sync isn't a bad idea.

If you have Enterprise Edition, you also have access to some other options for managing maintenance. Foremost is Online Index Rebuilds, which will allow you to rebuild indexes while they are still in use (essentially, it builds the index side by side, then swaps it in). You also can leverage partitioning for "large" tables to reduce the amount of rebuild time necessary.

Your best bet for this type of maintenance, if you don't have custom scripts that handle these best practices, is to use Ola Hallengren's Maintenance scripts. These are fairly easy to setup and configure and have many of these guidelines built in.

DBCC Consistency Checks

Depending on your overall workload, you might find DBCC checks to be to disruptive for your operation. There are two common ways to minimize your DBCC impact for your databases:

  • PHYSICAL_ONLY - Running this option will check your databases at a physical page level and avoid the more invasive full check. This will cover identifying the most likely types of corruption.
  • Checking a restored copy - If you have the space, you can restore the database to another instance and run a DBCC check against the restored copy. This will tell the same tale about your live database, but you obviously won't interfere with the activity. Some other alternatives here are running DBCC against a log shipped copy or a mirrored db.

This blog post provides more detail on your options.

Batch jobs/ETL

This really comes down to how you design your processes. Your ETL can always interfere with live OLTP tables (just as any other application), so some keys to keep in mind:

  • Schedule such work around your other maintenance and in low activity periods.
  • Right size the work so that it's batched for both performance and so that the batch is not so big that it locks your table for hours. Examples of the ends of the spectrum: Row-by-agonizing-row (RBAR) versus a single million row delete.
  • Use stage tables and offline your data processing where appropriate. Only touch the live stuff when absolutely necessary.

Conclusion

Again, there's a lot of ground to cover here. This is not a comprehensive guide, but a high level overview of some approaches. I haven't even discussed high-availability options (such as Availability Groups and Failover Clustering). You will need to review each item and build out a plan for how to handle it. In many ways, you will also need to iterate and refine your work as you move forward.

Additional resources:

SQL Skills VLDB maintenance best practices