Sql-server – query optimization: time intervals

performancequery-performancesql-server-2008sql-server-2008-r2

In the main, I've got two kinds of time intervals:

presence time and absence time

absence time can be of different types (eg breaks, absences, special day and so on) and time intervals may overlap and / or intersect.

It is not for sure, that only plausible combinations of intervals exist in raw data, eg. overlapping presence-intervals don't make sense, but may exist. I've tried to identify resulting presence-time intervals in many ways now – for me, the most comfortable seems to be the follwing one.

;with "timestamps"
as
(
    select
        "id" = row_number() over ( order by "empId", "timestamp", "opening", "type" )
        , "empId"
        , "timestamp"
        , "type"
        , "opening"
    from
    (
        select "empId", "timestamp", "type", case when "types" = 'starttime' then 1 else -1 end as "opening" from
        ( select "empId", "starttime", "endtime", 1 as "type" from "worktime" ) as data
        unpivot ( "timestamp" for "types" in ( "starttime", "endtime" ) ) as pvt
        union all
        select "empId", "timestamp", "type", case when "types" = 'starttime' then 1 else -1 end as "opening" from
        ( select "empId", "starttime", "endtime", 2 as "type" from "break" ) as data
        unpivot ( "timestamp" for "types" in ( "starttime", "endtime" ) ) as pvt
        union all
        select "empId", "timestamp", "type", case when "types" = 'starttime' then 1 else -1 end as "opening" from
        ( select "empId", "starttime", "endtime", 3 as "type" from "absence" ) as data
        unpivot ( "timestamp" for "types" in ( "starttime", "endtime" ) ) as pvt
    ) as data
)
select 
      T1."empId"
    , "starttime"   = T1."timestamp"
    , "endtime"     = T2."timestamp"
from 
    "timestamps" as T1
    left join "timestamps" as T2
        on T2."empId" = T1."empId"
        and T2."id" = T1."id" + 1
    left join "timestamps" as RS
        on RS."empId" = T2."empId"
        and RS."id" <= T1."id"      
group by
    T1."empId", T1."timestamp", T2."timestamp"
having
    (sum( power( 2, RS."type" ) * RS."opening" ) = 2)
order by 
    T1."empId", T1."timestamp";

see SQL-Fiddle for some demo data.

The raw data exist in different tables in form of "starttime" - "endtime" or "starttime" - "duration".

The idea was to get an ordered list of every timestamp with a "bitmasked" rolling sum of open intervals at each time to estimate presence time.

The fiddle works and gives estimated results, even if startimes of different intervals are equal. No indices are used in this example.

Is this the right way to achieve questioned task or is there a more elegant way for this?

If relevant for answering: amount of data will be up to several ten-thousand datasets per employee per table. sql-2012 is not available to calculate a rolling sum of predecessors inline in aggregate.


edit:

Just executed the query against larger amount of testdata ( 1000, 10.000, 100.000, 1 million) and can see that runtime increases exponentially. Obviously a warning flag, right?

I changed the query and removed aggregation of rolling sum by a quirky update.

I've added an auxiliary table:

create table timestamps
(
  "id" int
  , "empId" int
  , "timestamp" datetime
  , "type" int
  , "opening" int
  , "rolSum" int
)

create nonclustered index "idx" on "timestamps" ( "rolSum" ) include ( "id", "empId", "timestamp" )

and I moved calculating rolling sum to this place:

declare @rolSum int = 0
update "timestamps" set @rolSum = "rolSum" = @rolSum + power( 2, "type" ) * "opening" from "timestamps"

see SQL-Fiddle here

The runtime decreased to 3 sec regarding 1 million entries in the "worktime"-table.

Question stays the same: What's the most effective way to solve this?

Best Answer

I can't answer your question as to the absolutely best way. But I can offer a different way of solving the problem, which may or may not be better. It has a reasonably flat execution plan, and I think it will perform well. (I'm eager to know so share the results!)

I apologize for using my own syntax style instead of yours--it helps query wizardry come to me when everything lines up in its usual place.

The query is available in a SqlFiddle. I threw in an overlap for EmpID 1 just to be sure I had that covered. If you eventually find that overlaps cannot occur in presence data, then you can remove the final query and the Dense_Rank calculations.

WITH Points AS (
  SELECT DISTINCT
    T.EmpID,
    P.TimePoint
  FROM
    (
      SELECT * FROM dbo.WorkTime
      UNION SELECT * FROM dbo.BreakTime
      UNION SELECT * FROM dbo.Absence
    ) T
    CROSS APPLY (VALUES (StartTime), (EndTime)) P (TimePoint)
), Groups AS (
  SELECT
    P.EmpID,
    P.TimePoint,
    Grp =
      Row_Number()
      OVER (PARTITION BY P.EmpID ORDER BY P.TimePoint, X.Which) / 2
  FROM
    Points P
    CROSS JOIN (VALUES (1), (2)) X (Which)
), Ranges AS (
  SELECT
    G.EmpID,
    StartTime = Min(G.TimePoint),
    EndTime = Max(G.TimePoint)
  FROM Groups G
  GROUP BY
    G.EmpID,
    G.Grp
  HAVING Count(*) = 2
), Presences AS (
  SELECT
    R.*,
    P.Present,
    Grp =
       Dense_Rank() OVER (PARTITION BY R.EmpID ORDER BY R.StartTime)
       - Dense_Rank() OVER (PARTITION BY R.EmpID, P.Present ORDER BY R.StartTime)
  FROM
    Ranges R
    CROSS APPLY (
      SELECT
        CASE WHEN EXISTS (
          SELECT *
          FROM dbo.WorkTime W
          WHERE
            R.EmpID = W.EmpID
            AND R.StartTime < W.EndTime
            AND W.StartTime < R.EndTime
        ) AND NOT EXISTS (
          SELECT *
          FROM dbo.BreakTime B
          WHERE
            R.EmpID = B.EmpID
            AND R.StartTime < B.EndTime
            AND B.StartTime < R.EndTime
        ) AND NOT EXISTS (
          SELECT *
          FROM dbo.Absence A
          WHERE
            R.EmpID = A.EmpID
            AND R.StartTime < A.EndTime
            AND A.StartTime < R.EndTime
        ) THEN 1 ELSE 0 END
    ) P (Present)
)
SELECT
  EmpID,
  StartTime = Min(StartTime),
  EndTime = Max(EndTime)
FROM Presences
WHERE Present = 1
GROUP BY
  EmpID,
  Grp
ORDER BY
  EmpID,
  StartTime;

Note: performance of this query would be improved you combined the three tables and added a column to indicate what kind of time it was: work, break, or absence.

And why all the CTEs, you ask? Because each one is forced by what I need to do to the data. There's an aggregate, or I need to put a WHERE condition on a windowing function or use it in a clause where windowing functions aren't allowed.

Now I'm going to go off and see if I can't think up another strategy for accomplishing this. :)

For amusement I include here a "diagram" I made to help solve the problem:

------------
   -----------------
                ---------------
                           -----------

    ---    ------   ------       ------------

----   ----      ---      -------

The three sets of dashes (separated by spaces) represent, in order: presence data, absence data, and the desired result.