Sql-server – Linking job steps from a single run together in sysjobhistory

sql server

We have several SQL Servers running SQL Agent jobs, recording history in their own sysjobhistory tables. I am attempting to set up one centralized server with a job that collects the history from all the other servers, formats it a bit, and puts it into a table called AllJobHistory. As part of this process, I would like one column to indicate that multiple steps of a job were part of the same job run. They are already marked as being part of the same job through the job_id column, but I want to know that specific rows came from the 3:00 run of the job versus the 4:00 run. Being able to filter based on this column would make troubleshooting jobs that much easier for us, but I don’t see anything that links those steps together in any existing system table or DMV, is there?

My first attempt at rolling my own was to use the run_date, run_time, and run_duration columns. For each step, if I subtract the total run_duration up to this point from the run_time, it should bring me back to a time that is unique compared to all other runs of this job. This looked like it was working until I found a case where it wasn’t (likely because SQL Server is rounding run_time and run_duration at the precision of seconds). Here is my attempt at the query (with extra columns removed).

WITH JobDetails AS 
(
    SELECT 
        QUOTENAME(UPPER('ServerName')) AS [Server],
        j.job_id AS [JobID],
        j.name AS [JobName],
        s.step_id AS [Step],
        msdb.dbo.agent_datetime(run_date, run_time) AS [RunDate],
        (run_duration/10000*3600 + (run_duration/100)%100*60 + run_duration%100) AS [RunDurationSeconds]
    FROM msdb.dbo.sysjobhistory h
    INNER JOIN msdb.dbo.sysjobs j ON j.job_id = h.job_id
    LEFT OUTER JOIN msdb.dbo.sysjobsteps s ON s.job_id = h.job_id AND s.step_id = h.step_id 
    WHERE h.step_id != 0
), GroupedDetails AS (
    SELECT 
        jd.[Server],
        jd.[JobID],
        jd.JobName,
        jd.Step,
        jd.RunDate,
        jd.RunDurationSeconds,
        DATEADD(SECOND, 
            -ISNULL(SUM(jd.RunDurationSeconds) OVER
                (PARTITION BY jd.JobName ORDER BY jd.JobName, jd.RunDate, jd.Step 
                ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0), 
            jd.RunDate) AS grp
    FROM JobDetails AS  jd
)
SELECT 
    gd.[Server],
    gd.JobName,
    gd.Step,
    gd.RunDate,
    gd.RunDurationSeconds,
    CONVERT(VARCHAR(36), gd.JobID) + '_' + FORMAT(gd.grp, 'yyyyMMdd_HHmmss') AS JobRunString
FROM GroupedDetails AS gd;

Here is an example where it worked as I wanted it to, for a job that has three steps. Note that the JobRunString matches for the first run and again for the second run.
WorkingExample

And here is an example where it did not work as I wanted it to. Note that Step1RunDate + Step1RunDurationSeconds != Step2RunDate, causing a mismatch of JobRunString.
NotWorkingExample

So, is there any reliable way of linking the steps of a job run together in sysjobhistory?

Best Answer

Note that sysjobhistory has an ID column (instance_id). at least one entry should hit the table for each step of a job that was completed, followed by an entry with step_id = 0, recording the outcome of the job. Each step also records the time (run_date and run_time) the step was started, which will be equal to or greater than the time the job was started. So, the step_id = 0 row for a given run has a higher instance_id than the related steps, but a lower (or equal) run time.

So, try doing an initial pull of the data from the rows where step_id = 0 into a temp table (or equivalent). Then, all rows from sysjobhistory with the same job_id, a lower instance_id, and a higher or equal start time (from run_date and run_time) should belong to the job run you're looking for.

I used something much like this for a failed job report I used to run, at a former employer.

Here's a stripped down, modified version of that code. It ran for me in a quick test on a SQL Server 2016 box, just now. However, I don't have any jobs that run frequently enough that multiple runs would have the same run times.

USE msdb;

DECLARE @start_date varchar(8) = '20171001';

IF(OBJECT_ID('tempdb..#failhist') IS NOT NULL) DROP TABLE #failhist;

CREATE TABLE #failhist
(
    [job_name] [sysname] NOT NULL,
    [run_datetime] [datetime] NULL,
    [run_ended] [datetime] NULL,
    [instance_id] [int] NOT NULL,
    [job_id] [uniqueidentifier] NOT NULL,
    [run_date] [int] NOT NULL,
    [run_time] [int] NOT NULL
);


INSERT INTO #failhist
select *
  FROM
(SELECT j.name as job_name
       ,CONVERT(DATETIME, RTRIM(run_date)) + (run_time * 9 + run_time % 10000 * 6 + run_time % 100 * 10) / 216e4
        as run_datetime
       ,CONVERT(DATETIME, RTRIM(run_date)) + (run_time * 9 + run_time % 10000 * 6 + run_time % 100 * 10) / 216e4 + (run_duration * 9 + run_duration % 10000 * 6 + run_duration % 100 * 10) / 216e4
        as run_ended
       ,h.instance_id
       ,h.job_id
       ,h.run_date
       ,h.run_time
   from msdb..sysjobhistory h INNER JOIN msdb..sysjobs j ON h.job_id = j.job_id
  WHERE h.step_id = 0
    and h.run_date >= @start_date
) x
;

SELECT
       t.job_name AS [Job/Step]
      ,' ' + CONVERT(varchar(19), t.run_datetime, 121) + ' to ' + CONVERT(varchar(19), t.run_ended, 121) AS Run_Msg
      ,0 as HdrDetail
      ,t.job_name
      ,t.run_datetime
      ,t.instance_id
      ,-1 as step_id
      ,jh.instance_id as actual_instance_id
  FROM #failhist t
         INNER JOIN msdb..sysjobhistory jh ON (    t.job_id = jh.job_id
                                               AND t.instance_id >= jh.instance_id
                                               AND (   t.run_date < jh.run_date
                                                    OR (    t.run_date = jh.run_date
                                                        AND t.run_time <= jh.run_time
                                                       )
                                                   )
                                              )
 WHERE jh.step_id = 0
UNION ALL
SELECT
       '        '
      +CASE
         WHEN jh.step_id = 0
           THEN 'Job Summary'
         WHEN jh.run_status BETWEEN 0 AND 1  -- summary
           THEN RIGHT(CAST(100 + jh.step_id as varchar),2) + ' - ' + jh.step_name
         ELSE   '  ' + RIGHT(CAST(100 + jh.step_id as varchar),2) + ' (add''l info)'
       END
      ,'        ' + CAST(jh.message as varchar(max))
      ,1
      ,t.job_name
      ,t.run_datetime
      ,t.instance_id
      ,jh.step_id
      ,jh.instance_id
  FROM #failhist t
         INNER JOIN msdb..sysjobhistory jh ON (    t.job_id = jh.job_id
                                               AND t.instance_id >= jh.instance_id
                                               AND (   t.run_date < jh.run_date
                                                    OR (    t.run_date = jh.run_date
                                                        AND t.run_time <= jh.run_time
                                                       )
                                                   )
                                              )
 ORDER BY run_datetime, instance_id, step_id

If you actually have a job that runs more that once a second, you might have to use a windowing function, to ensure that you don't pick up job steps from an earlier run of the job with the same run_time value.

Caveat: If the limit on the number of rows each job can have in sysjobhistory is hit, you may get odd/incomplete results. Also, I occasionally saw jobs that failed without generating a job outcome (usually a temporary failure to authenticate the Windows user running the job).