We have been asked to report on the health of our SQL jobs. I am having trouble understanding the run_status
column in dbo.sysjobhistory
. The MSDB page for sysjobhistory gives this description:
Status of the job execution:
- 0 = Failed
- 1 = Succeeded
- 2 = Retry
- 3 = Canceled
What events cause each status code?
Here is what I have done to attempt to produce each status code. I was able to reproduce…
- 0 (Failed) by writing SQL that throws an error.
- For example, step 1 is
SELECT SELECT
- For example, step 1 is
- 1 (Succeeded) by writing SQL that succeeds.
- For example, step 1 is
SELECT 1
- For example, step 1 is
- 3 (Cancelled) by stopping a job while it was running
- 4 (In progress) by querying sysjobshistory while a job was running
I know a way produce 0's, 1's, 3's, and 4's, but I don't know all of the ways to produce them, and I have no idea how to cause a 2.
The health of our jobs is binary. We are either healthy or not healthy, because the end user does not want to see anything more granular than that on their dashboard. That is why I am asking this question – I have to interpret the status codes and run durations and draw a single, binary conclusion about the health of the SQL jobs. The problem is that MSDN doesn't explain what the status codes mean, or more importantly, what events might cause them.
For example, if a user cancels a job (resulting in a run_status of 3), we might decide that we don't care and just report that we are healthy. However, if a job gets cancelled because of some other event, we might care about that.
Best Answer
Create a job with a single step. On the advanced tab for that step, change retry attempts from 0 to something greater than 0. In the job step, have it be your
SELECT SELECT
. Run the job. You will see an entry .I don't think anyone could provide a full list of all scenarios that could lead to the various possible status values. Did you try anything like this while a job was running (and it's easy to test this with a simple job that has a loop with
SELECT TOP (1) name FROM somedb.sys.all_objects;WAITFOR DELAY '00:00:01';
); things like:somedb
tosingle_user
oroffline
run_status
is useful, but you can get a better picture from the combination ofrun_status
andmessage
(and, in severe cases, the SQL Server error log). The trouble withmessage
is that the contents can get truncated, so if you have a step that performs 400 successful backups and then one fails, the content will be full of successful messages, and the failure will be missing. For important jobs you may want to log to a table or to a file; you can see a variety of options on the Advanced tab of any job step.