Monitoring Maintenance Tasks in AlwaysOn Clusters

availability-groupsmaintenancemonitoring

I have an environment with multiple AGs on a single server and they encounter a split situation in which some end up on one node and some end up on the other node (in a failover event, perhaps). When maintenance tasks (the basic gamut of reindexing, update, etc.) execute on these nodes, they will only effectively perform operations on databases which are read_write on the primary side of their respective AGs. This causes the job to fail since the databases are read_only on the secondaries. Therefore, in a split AG situation, both nodes will always show failed maintenance in the job history.

Needless to say, it is difficult to monitor these clusters. I have proposed two possible solutions to management. We either script the maintenance tasks using DBCC commands for the individual databases based on their updateability properties, or we simply limit ourselves to one AG per server. They don't like either solution. Does anyone know of a way I can keep the built-in SQL Server maintenance tasks and have them target only the primary databases?

Best Answer

SQL Server Maintenance Plans do not have in-built functionality to detect the AG replica state for a particular DB and act accordingly. To do this, you must implement custom functionality.

Depending on how you're implementing your maintenance tasks, there are a few options for this.

Option #1:

If you're using SQL Maintenance Plans, then create one set of maintenance plans per availability group and target only the DBs in that group. Remove the schedules for these plans and then create a SQL Agent job that runs on the schedule you want. Have that Agent job check sys.dm_hadr_availability_replica_states to confirm the AG in question is in the primary state and if so, execute the appropriate Agent jobs to run the maintenance plan tasks.

Option #2:

If you're using SQL Agent jobs with T-SQL scripts to perform your maintenance tasks, add in some logic to your scripts to check sys.dm_hadr_database_replica_states prior to executing to validate that the database in question is indeed the primary.

Option #3:

Redesign your maintenance processes to utilise Ola Hallengren's excellent maintenance solution. This has the logic for handling Availability Groups already built in, so this saves you time in designing, testing and deploying a solution yourself. You simply deploy an identical set of Agent jobs to run the stored procedures on every replica. They will all execute, but only the ones you want to perform any work will do so.

Notes:

  • Some tasks in Ola's solution run on all replicas all the time, such as Integrity Checks and server-level maintenance (log cycling etc). These should be run on all replicas (yes, even integrity checks), so you shouldn't be concerned about this.
  • Ola's solution checks the preferred backup replica using sys.fn_hadr_backup_is_preferred_replica before backing up, so make sure your backup preferences are correctly set on your AGs.
  • Maintenance Plans, in general, are not recommended by most DBAs because of the inflexibility they often impose. T-SQL or PowerShell based solutions are usually preferred as they're much more robust.
  • Options #2 and #3 are essentially the same, but option #3 means someone else has done the hard work for you. Additionally, the widespread use of this solution means there is lots of help and advice in the SQL community if you run into issues.

Any of these options should be deployed in a fashion where the same jobs are deployed on all replicas and run at the same time. The only differences being that some tasks are only performed on the primary replica, while others just run the job and exit quietly performing no work because of the replica state.

Your management team may not like the idea of refactoring the current maintenance processes, but the reality is they have to if they want to automate this to handle failover, otherwise, the maintenance tasks have to be managed manually post-failover.