Sql-server – SQL Agent jobs in Basic Availability Group setup

availability-groupssql server

I have the following scenario on SQL Server 2016 Standard Edition 13.0.4411.0 (SP1-CU1):

2 database servers (DBNode1 and DBNode2). I have a database in a Basic Availability Group, with DBNode1 being the Primary and DBNode2 being the Secondary. As expected, if I try connecting to the DB on DBNode2, it fails as expected with the "The target database, …, is participating in an availability group and is currently not accessible for queries" error

I've created a SQL Server Agent Job on DBNode2 only (the Secondary) to perform a dummy UPDATE on the database every minute as a test and expected this to fail as that DB should not be accessible on that Secondary node.

UPDATE dbo.SomeTable SET Field1=GETDATE() WHERE Id = 1

The "Database" option for the job step is set to my DB.
However, it succeeds and has actually performed the operation on the Primary node. (See images 1 and 2 below)

If I change the Job to target the master database, and use the following TSQL command structure, this does then error as I'd expect, being the Secondary node (See images 3 and 4 below).

UPDATE MyDB.dbo.SomeTable SET Field1=GETDATE() WHERE Id = 1

It seems like behind the scenes, despite targetting the local server, this is actually working out that the target DB is in an Availability Group and so connects to the Primary node and runs the command there. Is this normal / documented behaviour?

UPDATE:

Added screenshots to show the setup, to avoid confusion.

1 – original job setup showing the unexpected behaviour

2 – original job history output, showing the successful execution

3 – revised job setup showing the expected behaviour

4 – revised job history output, showing the failed execution as expected

Best Answer

Sadly, I'm going to argue that you're just 'scratching the surface' in terms of some of the complexity associated with SQL Server Agent Jobs and some of the complexities you face with Availability Groups (AGs). And, I say "sadly" because Microsoft has actually done a VERY poor job of addressing these concerns to the point where there's virtually no guidance from them.

So, for example - assume you get this working as expected and then you've got a job set to run at the same time on both servers and it'll automatically run on the current/correct replica (depending upon which is the primary) - great/perfect. Until someone comes in and disables the job or changes the schedule - but only makes that change on ONE server (not both) - and then there's a failover. Now what? (Or, in other words, if it's important that a job run (or NOT run) then it's important that whatever changes are made to the job should be synchronized between one server and the next (in a 2-node AG) and, yet, it's incredibly easy to let jobs like this get out of sync - even when there's only 1 'DBA' managing the servers.

As such, I spewed out a set of blog posts on all of this a while back: http://sqlmag.com/blog/alwayson-availability-groups-and-sql-server-jobs-part-27-options-and-concerns-more-advanced-dep

The blog posts are VERY MUCH a 'stream of consciousness' sort of affair - and there are a couple of mis-steps in the posts and things are LESS insightful than I'd like. Still, by skimming through those posts, you can gain some great insights (i think) on some of the problems to expect and OPTIONS for working around those (with various pros/cons configured).

Eventually, I'll rewrite that set of posts into something more concise (and more 'wise' - i.e., that benefits from the number of AGs I've worked with since those posts - and all of the changes I've made). That said, here's a rough overview of what I do now to address these details:

I fully use the UDFs that determine whether the replica is the principal or not (either based on db-name or ag-name).
Rather than use the 'job state table' I outlined (as that's too clunky), jobs are either a) run as 'server-level' jobs (backups, server event logs/history maintenance/etc.) or b) assigned a 'disabled category' - meaning the job is/should be disabled (and then my code checks to make sure 'disabled' jobs are disabled and so on) or c) set to a SQL Server Agent Job category where the NAME of the category = the name of the AG they're in - so that when failover happens, code traps said failover and responds on the new primary by enabling all jobs of categoryname = 'ag-name' and doing the opposite on the secondary (if it's still up).
Still running synchronization checks (vs server-level details and jobs) every x minutes - to catch situations where something changes on one server - but not the other. (I don't bother letting CODE try and sync the differences - instead, it just reports that something is different.)

Best Answer

Related Solutions

Scheduling Backup Jobs for an Availability Group in SQL Server 2014

Sql-server – Creating SQL Server Job on each replica in availability group

Related Question