We have several SQL Servers running SQL Agent jobs, recording history in their own sysjobhistory tables. I am attempting to set up one centralized server with a job that collects the history from all the other servers, formats it a bit, and puts it into a table called AllJobHistory. As part of this process, I would like one column to indicate that multiple steps of a job were part of the same job run. They are already marked as being part of the same job through the job_id column, but I want to know that specific rows came from the 3:00 run of the job versus the 4:00 run. Being able to filter based on this column would make troubleshooting jobs that much easier for us, but I don’t see anything that links those steps together in any existing system table or DMV, is there?
My first attempt at rolling my own was to use the run_date, run_time, and run_duration columns. For each step, if I subtract the total run_duration up to this point from the run_time, it should bring me back to a time that is unique compared to all other runs of this job. This looked like it was working until I found a case where it wasn’t (likely because SQL Server is rounding run_time and run_duration at the precision of seconds). Here is my attempt at the query (with extra columns removed).
WITH JobDetails AS
(
SELECT
QUOTENAME(UPPER('ServerName')) AS [Server],
j.job_id AS [JobID],
j.name AS [JobName],
s.step_id AS [Step],
msdb.dbo.agent_datetime(run_date, run_time) AS [RunDate],
(run_duration/10000*3600 + (run_duration/100)%100*60 + run_duration%100) AS [RunDurationSeconds]
FROM msdb.dbo.sysjobhistory h
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = h.job_id
LEFT OUTER JOIN msdb.dbo.sysjobsteps s ON s.job_id = h.job_id AND s.step_id = h.step_id
WHERE h.step_id != 0
), GroupedDetails AS (
SELECT
jd.[Server],
jd.[JobID],
jd.JobName,
jd.Step,
jd.RunDate,
jd.RunDurationSeconds,
DATEADD(SECOND,
-ISNULL(SUM(jd.RunDurationSeconds) OVER
(PARTITION BY jd.JobName ORDER BY jd.JobName, jd.RunDate, jd.Step
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0),
jd.RunDate) AS grp
FROM JobDetails AS jd
)
SELECT
gd.[Server],
gd.JobName,
gd.Step,
gd.RunDate,
gd.RunDurationSeconds,
CONVERT(VARCHAR(36), gd.JobID) + '_' + FORMAT(gd.grp, 'yyyyMMdd_HHmmss') AS JobRunString
FROM GroupedDetails AS gd;
Here is an example where it worked as I wanted it to, for a job that has three steps. Note that the JobRunString matches for the first run and again for the second run.
And here is an example where it did not work as I wanted it to. Note that Step1RunDate + Step1RunDurationSeconds != Step2RunDate, causing a mismatch of JobRunString.
So, is there any reliable way of linking the steps of a job run together in sysjobhistory?
Best Answer
Note that
sysjobhistory
has an ID column (instance_id
). at least one entry should hit the table for each step of a job that was completed, followed by an entry withstep_id
= 0, recording the outcome of the job. Each step also records the time (run_date
andrun_time
) the step was started, which will be equal to or greater than the time the job was started. So, thestep_id
= 0 row for a given run has a higherinstance_id
than the related steps, but a lower (or equal) run time.So, try doing an initial pull of the data from the rows where
step_id
= 0 into a temp table (or equivalent). Then, all rows fromsysjobhistory
with the samejob_id
, a lowerinstance_id
, and a higher or equal start time (fromrun_date
andrun_time
) should belong to the job run you're looking for.I used something much like this for a failed job report I used to run, at a former employer.
Here's a stripped down, modified version of that code. It ran for me in a quick test on a SQL Server 2016 box, just now. However, I don't have any jobs that run frequently enough that multiple runs would have the same run times.
If you actually have a job that runs more that once a second, you might have to use a windowing function, to ensure that you don't pick up job steps from an earlier run of the job with the same run_time value.
Caveat: If the limit on the number of rows each job can have in
sysjobhistory
is hit, you may get odd/incomplete results. Also, I occasionally saw jobs that failed without generating a job outcome (usually a temporary failure to authenticate the Windows user running the job).